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ABSTRACT 

Efficient join processing has been a core algorithmic chal- 
lenge in relational databases for the better part of four decades. 
Recently Ngo, Porat, Re, and Rudra (PODS 2012) estab- 
lished join algorithms that have optimal running time for 
worst-case inputs. Worst-case measures can be misleading 
for some (or even the vast majority of) inputs. Instead, one 
would hope for instance optimality, e.g., an algorithm which 
is within some factor on every instance. In this work, we 
describe instance optimal join algorithms for acyclic queries 
(within polylog factors) when the data are stored as binary 
search trees. This result sheds new light on the complexity of 
the well-studied problem of evaluating acyclic join queries. 

We also devise a novel join algorithm over higher dimen- 
sional index structures (dyadic trees) that may be exponen- 
tially more efficient than any join algorithm that uses only 
binary search trees. Further, we describe a pair of lower 
bound results that establish the following (1) Assuming the 
well-known 3SUM conjecture, our new index gives optimal 
runtime for certain class of queries. (2) Using a novel, un- 
conditional lower bound, i.e., that does not use unproven as- 
sumptions like P ^ NP, we show that no algorithm can use 
dyadic trees to perform bow-tie joins better than poly log 
factors. 

1. INTRODUCTION 

Efficient join processing has been a core algorithmic 
challenge in relational databases for the better part of 
four decades and is related to problems in constraint 
programming, artificial intelligence, discrete geometry, 
and model theory. Recently, some of the authors of this 
paper (with Porat) devised an algorithm with a run- 
ning time that is worst-case optimal (in data complex- 
ity) [13]; we refer to this algorithm as NPRR. Worst- 
case analysis gives valuable theoretical insight into the 
running time of algorithms, but its conclusions may be 
overly pessimistic. This latter belief is not new and 
researchers have focused on ways to get better "per- 
instance" results. 

The gold standard result is instance optimality. Tra- 
ditionally, such a result means that one proves a bound 
that is linear in the input and output size for every 



instance (ignoring polylog factors). This was, in fact, 
obtained for acyclic natural join queries by Yannakakis' 
classic algorithm 21 . However, we contend that this 



scenario may not accurately measure optimality for database 
query algorithms. In particular, in the result above the 
runtime includes the time to process the input. How- 
ever, in database systems, data is often pre-processed 
into indexes after which many queries are run using the 
same indexes. In such a scenario, it may make more 
sense to ignore the offline pre-processing cost, which is 
amortized over several queries. Instead, we might want 
to consider only the online cost of computing the join 
query given the indexes. This raises the intriguing pos- 
sibility that one might have sub -linear-time algorithms 
to compute queries. Consider the following example 
that shows how a little bit of precomputation (sorting) 
can change the algorithmic landscape: 



Example 1.1. Swppose one is given two sequences of 
integers A = {ai} i=1 such that a% < • • • < ajy and 



B = {bj}^ =1 such that b\ < ■ ■ ■ < . The goal is to 
construct the intersection of A and B. efficiently. 

Consider the case when = 2i and bj = 2j + 1. The 
intersection is disjoint, but any algorithm seems to need 
to ping-pong back and forth between A and B. Indeed, 
one can show that any algorithm needs Q(N) time. 

But what if a at <b\? In this case, An B — $ again, 
but the following algorithm runs in time 0(logiV): skip 
to the end of the first list, see that the intersection is 
empty, and then continue. This simple algorithm is es- 
sentially optimal for this instance (see Sec. 
precise statement). 

Worst-case analysis is not sensitive enough to de- 
tect the difference between the two examples above — a 
worst-case optimal algorithm could run in time il(N) 
on all intersections of size N and still be worst-case 
optimal. Further, note that the traditional instance op- 
timal run time would also be f2(iV) in both cases. Thus, 
both such algorithms may be exponentially slower than 
an instance optimal algorithm on some instances (such 
algorithms run in time N, while the optimal takes only 
log TV time). 



2.2 for a 
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In this work, we discover some settings where one can 
develop join algorithms that are instance optimal (up 
to polylog factors). In particular, we present such an 
algorithm for acyclic queries assuming data is stored 
in Binary Search Trees (henceforth BSTs), which may 
now run in sublinear time. Our second contribution is 
to show that using more sophisticated (yet natural and 
well-studied) indexes may result in instance optimal al- 
gorithms for some acyclic queries that are exponentially 
better than our first instance optimal algorithm (for 
BSTs). 

Our technical development starts with an observation 
made by Melhorn |13| and used more recently by De- 
maine, Lopez-Ortiz, and Munro [7j (henceforth DLM) 
about efficiently intersecting sorted lists. DLM describes 
a simple algorithm that allows one to adapt to the in- 
stance, which they show is instance optimal)]] 

One of DLM's ideas that we use in this work is how 
to derive a lower bound on the running time of any al- 
gorithm. Any algorithm for the intersection problem 
must, of course, generate the intersection output. In 
addition, any such algorithm must also prove (perhaps 
implicitly) that any element that the algorithm does not 
emit is not part of the output. In DLM's work and ours 
the format of such a proof is a set of propositional state- 
ments that make comparisons between elements of the 
input. For example, a proof may say 05 < 67 which 
is interpreted as saying, "the fifth element of A (a*,) is 
smaller than seventh element of B (67)" or "a 3 and b s 
are equal." The proof is valid in the sense that any in- 
stance that satisfies such a proof must have exactly the 
same intersection. DLM reasons about the size of this 
proof to derive lower bounds on the running time of any 
algorithm. We also use this technique in our work. 

Efficient list intersection and efficient join process- 
ing are intimately related. For example, R(A) XI S(A) 
computes the intersection between two sets that are en- 
coded as relations. Our first technical result is to extend 
DLM's result to handle hierarchical join queries, e.g., 

H n = Ri(Ai)MR 2 (A 1 ,A 2 ) X • • • IX R n (A\, . . . , A n ) 

when the relations are sorted in lexicographical order 
(BST indexes on A\,...,Ai for i — l,...,n). Intu- 
itively, solving H n is equivalent to a sequence of nested 
intersections. For such queries, we can use DLM's ideas 
to develop instance optimal algorithms (up to log N fac- 
tors where N — maxj— 1 ,...,7i |-/?i|). There are some mi- 
nor technical twists: we must be careful about how we 
represent intermediate results from these joins, and the 
book keeping is more involved than DLM's case. 

Of course, not all joins are hierarchical. The sim- 
plest example of a non-hierarchical query is the bow-tie 
query: 

R(A) X S(A,B) X T(B) 



We first consider the case when there is a single, tra- 
ditional BST index on S, say in lexicographic order A 
followed by B while R (resp. T) is sorted by A (resp. 
B). To compute the join R{A) txj S(A,B), we can use 
the hierarchical algorithm above. This process leaves 
us with a new problem: we have created sets indexed 
by different values for the attribute A, which we de- 
note U a = o- A=a (R(A) M S(A, B)) for each a e A. Our 
goal is to form the intersection U a D T(A) for each such 
a. This procedure performs the same intersection many 
times. Thus, one may wonder if it is possible to clev- 
erly arrange these intersections to reduce the overall 
running time. However, we show that while this clever 
rearrangement can happen, it affects the running time 
by at most a constant factor. 

We then extend this result to all acyclic queries un- 
der the assumption that the indexes are consistently 
ordered, by which we mean that there exists a total 
order on all attributes and the keys for the index for 
each relation are consistent with that order. Further, 
we assume the order of the attributes is also a reverse 
elimination order (REO), i.e., the order in which Yan- 
nakakis processes the query (For completeness, we recall 
the definition in Appendix D.5.2). There are two ideas 
to handle such queries: (1) we must proceed in round- 
robin manner through the joins between several joins 
between pairs of relations. We use this to argue that 
our algorithm generates at least one comparison that 
subsumes a unique comparison from the optimal proof 
in each iteration. And, (2) we must be able to efficiently 
infer which tuples should be omitted from the output 
from the proof that we have generated during execu- 
tion. Here, by efficient we mean that each inference can 
be performed in time poly log in the size of the data 
(and so in the size of the proof generated so far). These 
two statements allow us to show that our proposed al- 
gorithm is optimal to within a polylog factor that de- 
pends only on the query size. There are many delicate 
details that we need to handle to implement these two 



statements. (See Section 3.3 for more details 



We describe instances where our algorithm uses bi- 
nary trees to run exponentially faster than previous ap- 
proaches. We show that the runtime of our algorithm 
is never worse than Yannakakis' algorithm for acyclic 
join queries. We also show how to incorporate our algo- 
rithm into NPRR to speed up acyclic join processing for 
certain class of instances, while retaining its worst-case 
guarantee. We show in Appendix [G] that the resulting 
algorithm may also be faster than the recently proposed 



Leapfrog-join that improved and simplified NPRR 19 



1 This argument for two sets has been known since 1972 12 



Beyond BSTs. All of the above results use binary search 
trees to index the data. While these data structures are 
ubiquitous in modern database systems, from a theoret- 
ical perspective they may not be optimal for join pro- 
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cessing. This line of thought leads to the second set 
of results in our paper: Is there a pair of index struc- 
ture and algorithm that allows one to execute the bow-tie 
query more efficiently? 

We devise a novel algorithm that uses a common, 
index structure, a dyadic tree (or 2D-BST), that ad- 
mits 2D rectangular range queries [2 . The main idea 
is to use this index to support a lazy book keeping 
strategy that intuitively tracks "where to probe next." 
We show that this algorithm can perform exponentially 
better than approaches using traditional BSTs. We 
characterize an instance by the complexity of encoding 
the "holes" in the instance which measure roughly how 
many different items we have to prune along each axis. 
We show that our algorithm runs in time quadratic in 
the number of holes. It is straightforward from our 
results to establish that no algorithm can run faster 
than linear in the number of holes. But this lower 
bound leaves a potential quadratic gap. Assuming a 
widely believed conjecture in computational geometry 
(the 3SUM conjecture [17]), we are able to show an al- 
gorithm that is faster than quadratic in the number of 
holes is unlikely. We view these results as a first step 
toward stronger notions of optimality for join process- 
ing. 

We then ask a slightly refined question: can one use 
the 2D-BST index structure to perform joins substan- 
tially faster? Assuming the 3SUM conjecture, the an- 
swer is no. However, this is not the best one could hope 
for as 3SUM is an unproven conjecture. Instead, we 
demonstrate a geometric lower bound that is uncondi- 
tional in that the lower bound does not rely on such 
unproven conjectures. Thus, our algorithm uses the in- 
dex optimally. We then extend this result by showing 
matching upper and (unconditional lower bounds) for 
higher-arity analogs of the bow-tie query. 

2. BACKGROUND 

We give background on binary-search trees in one and 
two dimensions to define our notation. We then give a 
short background about the list intersection problem 
(our notation here follows DLM). 

2.1 Binary Search Trees 

In this section, we recap the definition of (ID and) 
2D-BST and record some of their properties that will 
be useful for us. 

One-Dimensional BST. We begin with some proper- 
ties of the one-dimensional BST, which would be useful 
later. Given a set U with N elements, the 1D-BST for 
U is a balanced binary tree with N leaves arranged in 
increasing order from left to right. Alternatively, let r 
be the root of the 1D-BST for U. Then the subtree 
rooted at the left child of r contains the smallest 



elements from U and the subtree rooted at the right 
child of r contains the [y] largest elements in U. The 
rest of the tree is defined in a similar recursive manner. 
For a given tree T and a node v in T, let T v denote the 
subtree of T rooted at v. Further, at each node v in the 
tree, we will maintain the smallest and largest numbers 
in the sub-tree rooted at it (and will denote them by £ v 
and r v respectively). Finally, at node v, we will store 
the value n v = \%\r\ 

The following claim is easy to see: 

PROPOSITION 2.1. The 1D-BST for N numbers can 
be computed in 0{N dog N) time. 

Lemma 2.2. Given any BST 1 T for the set U and any 
interval [£, r] one can represent [£, r] fl U with subset 
W of vertices of T of size \W\ < 0(log|Z7|) such that 
the intersection is at the leaves of the forest U v< £wTv 
Further, this set can be computed in 0(log \ U\) time. 

Remark 2.3. The proof of Lemma \2.2\ also implies 
that all intervals are disjoint. Further, the vertices are 
added to W in the sorted order of their £ ( and hence, 
r ) values. 

For future use, we record a notation: 

Definition 2.4. Given an interval I and a BST for 
T , we use W(I, T) to denote the W as defined in Lemma 2.2 

We will need the following lemma in our final result: 



Lemma 2.5. Let T be a 1D-BST for the set U and 
consider two intervals I\ 2 12- Further, define U\\2 — 
(1% \ I2) H U. Then one can traverse the leaves in T 
corresponding to E/u2 (and identify them) in time 

0((l^i\ 2 | + |^(/ 2 ,T)|)-log|[/|). 

Two-Dimensional BST. We now describe the data struc- 
ture that can be used to compute range queries on 2D 
data. Let us assume that U is a set of n pairs (x, y) of 
integers. The 2D-BST T is computed as follows. 

Let T x denote the BST on the x values of the points, 
vertex v, we will denote the interval of v in 7~ x as 
[£*, r%]. Then for every vertex v in T , we have a BST 
(denoted by T Y (v)) on the y values such that (x, y) & U 
and x appears on a leaf of T x (i.e. x G [£ v ,r v ]). If the 
same y value appears for more than one x such that 
x G [£„,?-„], then we also store the number of such y's 
on the leaves (and compute n v for the internal nodes so 
that it is the weighted sum of the values on the leaves). 

For example, consider the set U in Figure [T] Its 2D- 
BST is illustrated in Figure [4} 

We record the following simple lemma that follows 
immediately from Lemma |2.2| 



2 If the leaves are weighted then n v will be the sum of the 
weights of all leaves in T v . 
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Figure 1: A set U = [3] x [3] - {(2,2)} of eight 
points in two dimension. 



Lemma 2.6. Let v be a vertex in T x . Then given 
any interval I on the y values, one can compute whether 
there is any leaf in T Y (v) with value in I (as well as 
get a description of the intersection) in O(logiV) time. 

2.2 List Intersection Problem 

Given a collection of of n sets A\,..., A n , each pre- 
sented in sorted order as follows: 

A s = {A s [l}, . . . , A S [N S }} where A s [i] < A s [j] 

for all s and i < j. We want to output the intersection 
of n sets Ai, i = 1, 2, . . . ,n. 

To do that, DLM introduced the notion of an argu- 
ment. 

Definition 2.7. An argument is a finite set of sym- 
bolic equalities and inequalities, or comparisons, of the 
following forms: (1) (A s [i\ < A t [f\) or (2) A s [i] = A t [j] 
for i,j > 1 and s,i £ [n]. An instance satisfies an ar- 
gument if all the comparisons in the argument hold for 
that instance. 

Some arguments define their output (up to isomor- 
phism). Such arguments are interesting to us: 

Definition 2.8. An argument P is called a B-proof 
if any collection of sets A\, . . . ,A n that satisfy P, we 
have P|™=i = B , i.e., the intersection is exactly B . 

Lemma 2.9. An argument P is a B-proof for the in- 
tersection problem precisely if there are elements b\, 
. . . , b n for each b £ B, where bi is an element of Ai 
and has the same value as b, such that 

• for each b £ B, there is a tree on n vertices, every 
edge (i,j) of which satisfies (bi —bj) £ P; and 

• for consecutive values b, c £ B (J {+oo, — oo}, the 
subargument involving the following elements is a 
-proof for that subinstance: from each Ai, take 
the elements strictly between bi and Cj . 



Algorithm 1 Fewest- Comparisons For Sets 
Input: Ai in sorted order for i = 1, . , . , n. 
Output: The a smallest B-Proof where B = n" =1 Aj 

1: e max i= i,...,„ Ai[V\. 

2: While not done do 

3: Let e; be the largest value in Ai such that e; < e 

4: Let e'i be e^s immediate successor in Aj. 

5: If e'j does not exist break (done) 

6: Let io = argmax i=1 n e[. 

7: If e = e'i for every i — 1, . . . , n then 

8: emit e\ = e' i+1 for i = 1, . . . , n — 1. 

9: else 

10: emit e', < e. 

t o 

11: e <- e' lQ 



Proof. Suppose an argument B has those two prop- 
erties in the above lemma. The first property implies 
that for every b € B, all sets Ai also contains b. So 
the set B is the subset of the intersection of n sets Ai, 
1 < i < n. The second property implies that for any 
consecutive values b, c e B Li {+oo, — oo}, there exists 
no value x strictly between b and c such that all sets Ai 
contains x. In other words, the intersection of n sets Ai 
is the subset of B. So the argument P is a B-proof. □ 

It is not necessary that every argument P that is 
a B-proof has the 2 properties above. However, for 
any intersection set instance, there always exists a proof 
that has those properties. We describe these results in 
Appendix |B.2[ 

We describe how the list intersection analysis works, 
which we will leverage in later sections. First, we de- 
scribe an algorithm, Algorithm [l] that generates the 
fewest possible comparisons. We will then argue that 
this algorithm can be implemented and run in time pro- 
portional to the size of that proof. 

Theorem 2.10. For any given instance, Algorithm^ 
generates a proof for the intersection problem with the 
fewest number of comparisons possible. 

Proof. For simplicity, we will prove for the intersec- 
tion problem of 2 sets A and B. The case of n > 2 is 
very similar. Without loss of generality, suppose that 
A[l] < B[l]. If B[l] £ A then define i to be the max- 
imum number such that A[i] < B[l]. Then the com- 
parison (A[i] < B[l)) is the largest possible index and 
any proof needs to include at least this inequality. This 
is implemented above. If B[l] £ A then define i to be 
the index such that A[i] = B[l]. Then the comparison 
(A[i] = B[l]) should be included in the proof for the 
same reason. Inductively, we start again with the set 
A from (i + l) th element and set B from B[l], Thus, 
the Algorithm [l] generates a proof for the intersection 
problem with the fewest comparisons possible. □ 
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In Algorithm [T] there is only one line inside the while 
loop whose running time depends on the data set size: 
Line [3] requires that we search in the data set, but 
since set is sorted a binary search can perform this in 
O (log AT) time where N = maxj = x,...,n \Ai\. Thus, we 
have shown: 

Corollary 2.11. Using the notation above and given 
sets Ax, ... , A n in sorted order, let T> be the fewest num- 
ber of comparisons that are needed to compute B = 
(~)™ =1 Ai. Then, there is an algorithm to run in time 
O(nVlogN). 

Informally, this algorithm has a running time with 
optimal data complexity (up to log N factors) . 

3. INSTANCE OPTIMAL JOINS WITH TRA- 
DITIONAL BINARY SEARCH TREES 

In this section, we consider the case when every rela- 
tion is stored as a single binary search tree. We describe 
three results for increasingly broad classes of queries 
that achieve instance optimality up to a log N factor 
(where N is the size of the largest relation in the in- 
put). (1) A standard algorithm for what we call hi- 
erarchical queries, which are essentially nested intersec- 
tions; this result is a warmup that describes the method 
of proof for our lower bounds and style of argument in 
this section. (2) We describe an algorithm for the sim- 
plest non-hierarchical query that we call bow-tie queries 
(and will be studied in Section The key idea here is 
that one must be careful about representing the inter- 
mediate output size, and a result that allows us to show 
that solving one bow-tie query can be decomposed into 
several hierarchical queries with only a small blowup 
over the optimal proof size. (3) We describe our re- 
sults for acyclic join queries; this result combines the 
previous two results, but has a twist: in more com- 
plex queries, there are subtle inferences made based on 
inequalities. We give an algorithm to perform this in- 
ference efficiently. 

3.1 Warmup: Hierarchical Queries 

In this section, we consider join queries that we call 
hierarchical. We begin with an example to simplify our 
explanation and notation. We define the following fam- 
ily of queries; for each n > 1 define H n as follows 

H n =R 1 {A 1 )MR 2 {A 1 ,A 2 ) IX ■ ■ ■ X R n (Ai, . . . , A n ). 

We assume that all relations are sorted in lexicographic 
order by attribute. Thus, all tuples in i?j are totally 
ordered. We write Ri[k] to denote the k th tuple in Ri 
in order, e.g., Ri[l] is the first tuple in Ri. An argu- 
ment here is a set of symbolic comparisons of the form: 
(1) R s [i] < R s [j], which means that R s [i] comes before 
Rt[j] in dictionary order, or (2) R s [i] — Rt[j], which 



Algorithm 2 Fewest-Comparisons For Hierarchical 
Queries 

Input: A hierarchical query H n 
Output: A proof of the output of H n 

1: e = max^x,...,^ R% [1] // e is the maximum initial 
value. 

2: While not done do 

3: let ei be the largest tuple in A t s.t. e, < e 
4: let e- be the successor of for i = 1, . . . , n. 
5: If there is no such e'j then break (done) 
6: io <- argmaXj =li „ 

7: // NB: i = n if {ej}" =1 agree on all attributes 
8: If {e^}™ =1 agree on all attributes then 
9: emit e' n in H n and relevant equalities. 

10: e -s— the immediate successor of e 

11: else 

12: emit ei < e 

13: e<-«4. 



means that R s [i] and Rt[j] agree on the first k com- 
ponents where k = min{s,<}. The notion of £?-proof 
carries over immediately. 

Our first step is to provide an algorithm that pro- 
duces a proof with the fewest number of comparisons; 
we denote the number of comparisons in the smallest 
proof as T>. This algorithm will allow us to deduce a 
lower bound for any algorithm. Then, we show that we 
can compute H n in time 0(nD log N + \H n \) in which 
N = maXi = x ... n |-Rj|; this running time is data com- 
plexity optimal up to log A. The algorithm we use 
to demonstrate the lower bound argument is in Algo- 
rithm 

Proposition 3.1. For any given hierarchical join query 
instance, Algorithm [2] generates a proof that contains 
no more comparisons than the hierarchical join query 
problem with the fewest comparisons possible. 

Proof. We only prove that all emissions of the al- 
gorithm are necessary. Fix an output set of H n and 
call it O. At each step, the algorithm tries to set the 
eliminator, e, to the largest possible value. There are 2 
emissions to the output: (1) We only emit each tuple in 
the output once, since e is advanced on each iteration. 
Thus, each of these emissions is necessary. (2) Suppose 
that all e\ do not agree, then we need to emit some in- 
equality constraint. Notice that e — e[ for some i and 
that e.i is from a different relation than e : otherwise, 
e\ = e — if this were true for all relations we would get 
a contradiction to there being some e[ that disagrees. 
If we omit ei Q < e, then we could construct an instance 
that agrees with our proof but allows one to set e.; = e. 
However, if we do that for all values then we could get 
a new output tuple since this tuple would agree on a all 
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attributes, and this would no longer be a O-proof. □ 

Observe that in Algorithm [2j in each iteration, the 
only operation whose execution time depends on the 
dataset size is in Line |3j i.e., all other operations are 
constant or 0{n) time. Since each relation is sorted, 
this operation takes at most maxilogl^LI using binary 
search. So we immediately have the following corollary 
of an efficient algorithm. 

Corollary 3.2. Computing H n = Ri M ■ ■ ■ M R„ of 
the hierarchical query problem, where every relation Ri 
has i attributes A\, . . . , A4 and is sorted in that order. 
Denote N — max{|i?i|, I-R2I, • ■ ■ , |-Rn|} andT> be the size 
of the minimum proof of this instance. Then H n can be 
computed in time 0(nT> log N + \H n \). 

It is straightforward to extend this algorithm and 
analysis to the following class of queries: 

Definition 3.3. Any query Q with a single relation 
is hierarchical and if Q — R\ \A ■ ■ ■ X R n is hierarchical 
and R is any relation distinct from Rj for j = 1, . . . , n 
that contains all attributes of Q then Q' = R% X • • • [X 
R n 1X1 R is hierarchical. 

And one can show: 

Corollary 3.4. If Q is a hierarchical query on re- 
lations Ri , . . . , R n then there is an algorithm that runs 
in time O (nD log N + |Q|) where N = max i=lj _ n \Ri\. 

Thus, our algorithm's run time has data complexity 
that is optimal to within log N factors. 

3.2 One-index BST for the Bow- Tie Query 

The simplest example of a non- hierarchical query, and 
the query that we consider in this section, we call the 
bow-tie query: 

Qm = R(X) X S(X,Y) X T(Y). 

We consider the classical case in which there is a sin- 
gle, standard BST on S with keys in dictionary order. 
Without loss, we assume the index is ordered by X fol- 
lowed by Y. A straightforward way to process the bow- 
tie query in this setting is in two steps: (1) Compute 
S'(X, Y) = R(X) X S(X, Y) using the algorithm for hi- 
erarchical joins in the last section (with one twist) and 
(2) compute Si,(Y) X T(Y) by using the intersection 
algorithm for each x in which SL, = o-x= x (S). Notice 
that the data in S' is produced in the order X followed 
by Y. This algorithm is essentially the join algorithm 
implemented in every database modulo the small twist 
we describe below. In this subsection, we show that 
this algorithm is optimal up to a log N factor (where 
JV = max{|fi|,|S|,|T|}). 

The twist in (1) is that we do not materialize the out- 
put of 5"; this is in contrast to a traditional relational 



database. Instead, we use the list intersection algorithm 
to identify those x such that would appear in the out- 
put of R(x), S(x,y). Notice, the projection wx{S) is 
available in time |7Tjf (<?) | log time using the BST. 
Then, we retain only a pointer for each x into its BST, 
which gives us the values associated with x in sorted or- 
der^ This takes only time proportional to the number 
of matching elements in S (up to log \ S\ factors). 

The main technical obstacle is the analysis of step 
(2). One can view the problem in step (2) as equivalent 
to the following problem: We are given a set B in sorted 
order (mirroring T above) and m sets Y\ . . . , Y m . Our 
goal is to produce A{ = Y> n B for i = 1, . . . , m. The 
technical concern is that since we are repeatedly inter- 
secting each of the Yi sets, we could perhaps be smarter 
and cleverly intersect the Y^ lists to amortize part of the 
computation and thereby lower the total cost of these 
repeated intersections. Indeed, this can happen (as we 
illustrate in the proof); but we demonstrate that the 
overall running time will change by only a factor of at 
most 2. 

The first step is to describe an Algorithm [3] to pro- 
duce a proof of the contents of Ai that has the following 
property: if the optimal proof is of length D, Algo- 
rithm [3] produces a proof with 21? comparisons. More- 
over, all proofs produced by the algorithm compare only 
elements of Yi (for i = 1, . . . , m) with elements of B. 
We then argue that step (2) to produce each Ai inde- 
pendently runs in time 0(T> log N). For brevity, the 
algorithm description in Algorithm [3] assumes that the 
smallest element of B is smaller than any element of Yj 
for i = 1, . . . , m initially. In the appendix, we include a 
more complete pseudocode. 

Proposition 3.5. With the notation above, if the 
minimal sized proof contains V comparisons, then Algo- 
rithm^ emits at most 2D comparisons between elements 
of B and Yi for i = 1, . . . , m 

We perform the proof in two stages in the Appendix: 
The first step is to describe simple algorithm to generate 
the actual minimal-sized proof, which we use in the sec- 
ond step to convert that proof to one in which all com- 
parisons are between elements of Yj for j = 1, . . . , m and 
elements B. The minimal-sized proof may make com- 
parisons between elements of y £ Yi and y' £ Yj that 
allow it to be shorter than the proof generated above. 
For example, if we have s < l\ = u± < I2 = U2 < s' 
we can simply write s < l±, l\ < Z 2 , and I2 < s' with 
three comparisons. In contrast, Algorithm [3] would gen- 
erate four inequalities: s < li, s < I2, h < s', and 

3 Equivalently, in Line [9] of Alg. [2] we modify this to emit 
all tuples between e' n and where this is the largest tuple 
such that agrees with e' n _i and then update e accordingly. 
This operation can be done in time log of the gap between 
these tuples, which means it is sublinear in the output size. 
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Algorithm 3 Fewest-Comparisons 1BST 
Input: A set B and m sets Yi, . . . , Y m 
Output: Proof of B n Yi for i = 1, . . . , m 



1: Active = [to] // initially all sets are active. 
2: While Exists active element in B and Active ^ 
do 

3: lj 4— the min element in Yj forj € Active. 

4: s the max element, s < ij forall j 6 Active. 

5: s' «— be s's successor in 5* (if s' exists). 

6: If s' does not exist then 

7: For j € Active do 

Emit lj 6 s for 6 G {<, >, =}. 

else 
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Uj -s— the max element in Yj s.t. Uj < s'. 

For j € Active do 

Emit s 9 lj and Uj 9 s' for G {=, <}. 
Eliminate elements a € A/, s.t. a < Uj 
Remove j from Active, if necessary. 

Eliminate elements x G B s.t. a; < s' . 



I2 < s. To see that this slop is within a factor 2, one 
can always replace a comparison y < y' with a pair of 
comparisons y' 9 x' and x 9 y for G {<, =} where x 



(resp. 



is the maximum (resp. minimum) element 



in B less than y (resp. greater than) y' . As we ar- 
gued above, the pairwise intersection algorithm runs in 
time 0(T> log N), while the proof above says that any 
algorithm needs £1(2?) time. Thus, we have shown: 

Corollary 3.6. For the bow-tie query, Qx defined 
above, when each relation is stored in a single BST, 
there exists an algorithm that runs in time 0(riD log N+ 
\Q\) in which N — max{|i?|, \S\, \T\} and T> is the min- 
imum number of comparisons in any proof. 

Thus, for bow-tie queries with a single index we get 
instance optimal results up to poly log factors. 

3.3 Instance Optimal Acyclic Queries with Re- 
verse Elimination Order of Attributes 

We consider acyclic queries when each relation is stored 
in a BST that is consistently ordered, by which we mean 
that the keys for the index for each relation are con- 
sistent with the reverse elimination order of attributes 
(REO). Acyclic queries and the REO order are defined 
in Abiteboul et al. [l] Ch. 6.4], and we recap these defin- 
tions in Appendix |D. 5. 2[ 

In this setting, there is one additional complication 
(compared to Qm) that we must handle and that we 
illustrate by example. 

Example 3.1. Let Q 2 join the following relations: 

R(X) = [N] Si(X,Xi) — [N] x [N] 

S 2 (X 1 ,X 2 ) = {(2,2)} Tpf 2 ) = {l,3} 
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Figure 2: Illustration for the run of Algo- 
rithm 1 1 2 1 on the example from Example |3.1| for 
N = 4. The tuples are ordered from smallest at 
the bottom to largest at the top and the "probe 
tuple" t moves from bottom to top. The ini- 
tial constraints are X < 1 and X > 4 (due to 
#1), {X, Y) < (1, 1) and (X, Y) > (4, 4) (due to i? 2 ), 
(Y,Z) < (2,2) and (Y, Z) > (2,2) (due to R 3 ) and 
Z < 1 and Z > 3 (due to R4). Initial probe tuple t 
(denoted by the red dotted line) is (1, 2, 2). Then 
we have e\ = e[ = (l),e 2 = e' 2 = (l,2),e 3 = e' 3 — 
(2, 2),e4 = (3),e4 = (1). The only new constraint 
added is 1 < Z < 3. This advances the new probe 
tuple to (1,2,3) and is denoted by the blue dot- 
ted line. However, at this point the constraints 
(Y, Z) > (2, 2), (Y, Z) < (2, 2) and 1 < Z < 3 rule out 
all possible tuples and Algorithm |12| terminates. 



The output of Q2 is empty, and there is a short proof: 
T[l]X 2 < S 2 [l]X 2 andS 2 [X]X 2 < T[l]X 2 (this certi- 
fies that T 1X1 S is empty). Naively, a DFS-style search 
or any join of R IX S\ will take Vt(N) time; thus, we 
need to zero in on this pair of comparisons very quickly. 



In Appendix |C.2| we see that running the natural 
modification of Algorithm[3]does discover the inequality- 
but it forgets it after each loop! In general, we may infer 
from the set of comparisons that we can safely eliminate 
one or more of the current tuples that we are consider- 
ing. Naively, we could keep track of the entire proof that 
we have emitted so far, and on each lower bound com- 
putation ensure that takes into account all constraints. 
This would be expensive (as the proof may be as bigger 
than the input, and so the running time of this naive 
approach would be least quadratic in the proof size). A 
more efficient approach is to build a data structure that 
allows us to search the proof we have emitted efficiently. 

Before we talk about the data structure that lets us 
keep track of "ruled out" tuples, we mention the main 
idea behind our main algorithm in Algorithm |12| At 



any point of time, Algorithm 12 queries the constraint 
data structure, to obtain a tuple t that has not been 
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ruled out by the existing constraints. If for every i <E 
[rn], 7r a ttr(_Ri)(t) € Ri, then we have a valid output tu- 
ple. Otherwise, there exists a smallest C{ > 7r attr (fl.) (t) 
and a largest e[ < 7r a ttr(K i )('t) for some i € [to]. In 
other words, we have found a "gap" + 1, — 1]. We 
then add this constraint to our data structure. (This 
is an obvious generalization of DLM algorithm for set 
intersection.) The main obstacle is to prove that we 
can charge at least one of those inserted interval to a 
"fresh" comparison in the optimal proof. We would like 
to remark that we need to generate intervals other than 
those of the form mentioned above to be able to do this 
mapping correctly. Further, unlike in the case of set 
intersection, we have to handle the case of comparisons 
between tuples of the same relation where such com- 
parisons can dramatically shrink the size of the optimal 
proof. The details are deferred to the appendix. 

To convert the above argument into an overall algo- 
rithm that run in time near linear in the size of the 
optimal proof, we need to design a data structure that 
is efficient. We first make the observation that we can- 
not hope to achieve this for any query (under standard 
complexity assumptions). However, we are able to show 
that for acyclic queries, when the attributes are ordered 
according to a global ordering that is consistent with an 
REO, then we can efficiently maintain all such prefixed 
constraints in a data structure that performs the infer- 
ence in amortized time: 0(n2 3n log N), which is expo- 
nential in the size of the query, but takes only 0(log N) 
as measured by data complexity. 

Theorem 3.7. For an acyclic query Q with the con- 
sistent ordering of attributes being the reverse elimina- 
tion order (REO), one can compute its output in time 

0(V- f(n,m)- log TV + mn2 3n \Output\ log AT) 

where N = max{|i?,j| | i = 1, . . . , n} + T> where T> is 
the number of comparisons in the optimal proof, where 
f(n, m) = mn2 2n + n2 4 ™ and depends only on the size 
of the query and number of attributes. 

A complete pseudo code for both the algorithm and 
data structure appears in Appendix |P| 

A worst-case linear-time algorithm for acyclic queries. 
Yannakakis' classic algorithm for acyclic queries run in 
time 0(|input| + | output |). Here, we ignore the small 
log factors and dependency on the query size. Our al- 
gorithm can actually achieve this same asymptotic run- 
time in the worst-case, when we do not assume that the 
inputs are indexed before hand. See Appendix |D . 2 . 4| for 
more details. 

Enhancing NPRR. We can apply the above algorithm 
to the basic recursion structure of NPRR to speed it up 
considerably for a large class of input instances. Re- 
call that in NPRR we use AGM bound [3] to estimate 



a subproblem size, and then decide whether to solve a 
subproblem before filtering the result with an existing 
relation. The filtering step will take linear time in the 
subproblem's join result. Now, we can simply run the 
above algorithm in parallel with NPRR and get the re- 
sult of whichever finishes first. In some cases, we will be 
able to discover a very short proof, much shorter than 
the linear scan by NPRR. When the subproblems be- 
come sufficiently small, we will have an acyclic instance. 
In fact, in NPRR there is also a notion of consistent at- 
tribute ordering like in the above algorithm and the 
indices are ready-made for the above algorithm. The 
simplest example is when we join, say, R[X] and S^AT]. 
In NPRR we will have to go through each tuple in R and 
check (using a hash table or binary search) to see if the 
tuple is present in S[X]. If R = [n] and S = [2n] — [n], 
for example, then Algorithm [12] would have discovered 
that the output is empty in logn time, which is an ex- 
ponential speed up over NPRR. 

On the non-existence of "optimal" total order. A natu- 
ral question is whether there exists a total order of at- 
tributes, depending only on the query but independent 
of the data, such that if each relation's BST respects 
the total order then the optimal proof for that instance 
has the least possible number of comparisons. Unfor- 
tunately the answer is no. In Appendix [X] we present 
a sample acyclic query in which, for every total order 
there exists a family of database instances for which the 
total order is infinitely worse than another total order. 

4. FASTER JOINS WITH HIGHER DIMEN- 
SIONAL SEARCH TREES 

This section deals with a simple question raised by 
our previous results: Are there index structures that al- 
low more efficient query processing than BST for join 
processing? On some level the answer is trivially yes as 
one can precompute the output of a join (i.e., a mate- 
rialized view). However, we are asking a more refined 
question: does there exist an index structure for a single 
relation that allows improved join query performance? 

The answer is yes, and our approach has at its core a 
novel algorithm to process joins over dyadic trees. We 
also show a pair of lower bound results that allow us 
to establish the following two claims: (1) Assuming the 
well-known 3SUM conjecture, our new index is optimal 
for the bow-tie query. (2) Using a novel, unconditional 
lower bouncQ we show that no algorithm can use dyadic 
trees to perform (a generalization of) bow-tie queries up 
to poly log factors. 

4.1 The Algorithm 



4 By unconditional, we mean that our proof does not rely on 
unproven conjectures like P 7^ NP or 3SUM hardness. 
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Figure 3: Holes for the case when R = T = {2} 
and S = [1,3] x [1,3] - {(2,2)}. The two X-holes 
are the light blue boxes and the two F-holes are 
represented by the pink boxes. 



Recall the bow-tie query, Qx which is defined as: 
Qm = R(X) X S(X,Y) X T(Y). 

We assume that R and T are given to us as sorted ar- 
rays while S is given to us in a two-dimensional Binary 
Search Tree (2-D BST), that allows for efficient orthog- 
onal range searches. With these data structures, we will 
show how to efficiently compute Qxi; in particular, we 
present an algorithm that is optimal on a per-instance 
basis for any instantiation (up to poly- log factors). 

For the rest of the section we will consider the follow- 
ing alternate, equivalent representation of Qm (where 
we drop the explicit mention of the attributes and we 
think of the tables R, S and T as being input tables): 

(RxT)n S. (1) 

For notational simplicity, we will assume that |T| < 
n and \S\ < to and that the domains of X and Y are 
integers and given two integers i < r, we will denote 
the set {£,..., r} by [£, r] and the set {£ + 1, . . . , r - 1} 
by(£,r). 

We begin with a definition of a crucial concept: holes, 
which are the higher dimensional analog of the pruning 
intervals in the previous section. 

Definition 4.1. We say the ith position in R (T 
resp.) is called an X-hole (Y -hole resp.) if there is no 
(x, y) € S such that ri < x < n+i (U < y < resp.), 
where rj (tj resp.) is the value in the jth position in R 
(T resp.) Alternatively we will call the interval (r^, ri+i) 
((ti, tj+i) resp.) an X-hole (Y -hole resp.) Finally, de- 
fine hx (hy resp.) to be the total number of X-holes 
(Y -holes resp.). 

See Figure [3] for an illustration of holes for a sample 
bow-tie query. 

Our main result for this section is the following: 

Theorem 4.2. Given an instance R, S and T of the 
bow-tie query as in |7p such that R and T have size at 



most n and are sorted in an array (or 1D-BST) and S 
has size m and is represented as a 2D-BST, the output 
O can be computed in time 

O (((hx + 1) ■ (h Y + 1) + \0\) • logn ■ log 2 to) . 



We will prove Theorem |4.2| in the rest of the section 
in stages. In particular, we will present the algorithm 
specialized to sub-classes of inputs so that we can in- 
troduce all the main ideas in the proof one at a time. 

We begin with the simpler case where hy = and the 
X-holes are Z 2 , . . . , Ih x +i an d we know all this informa- 
tion up front. Note that by definition, the X-holes are 
disjoint. Let Ox be the number of leaves in T x such 
that the corresponding X values do not fall in any of the 



given X-holes. Thus, by Lemma [2.5| and Remark B.l 
with Ii — (—00,00), in time 0((h x + |Ox|)logm) we 
can iterate through the leaves in Ox- Further, for each 
x € Ox, we can output all pairs (x, y) € S (let us de- 
note this set by y x ) by traversing through all the leaves 
in T Y (v), where v is the leaf corresponding to x in T ■ 
This can be done in time 0(|34|)- Since hy — 0, it is 
easy to verify that O = i) x eo x yx- Finally, note that 
we are not exploring T Y (u) for any leaf u whose cor- 
responding x values lies in an X-hole. Overall, this 
implies that the total run time is 0((hx + |C |) log ttt.) , 
which completes the proof for the special case consid- 
ered at the beginning of the paragraph. 

For the more general case, we will use the following 
lemma: 

Lemma 4.3. Given any (x,y) £ S, in O(logn) time 
one can decide which of the following hold 

(i) x € R and y G T; or 

(ii) x g" R (and we know the corresponding hole (£ x ,r x ) ); 
or 

(Hi) y T (and we know the corresponding hole (l y , r y )). 



The proof of Lemma |4.3| as well as the rest of the 
proof of Theorem |4.2| are in the appendix. The final 
details are in Algorithm |4j 

A Better Runtime Analysis. We end this section by de- 
riving a slightly better runtime analysis of Algorithm [4] 



than Theorem 4.2 in Theorem 4.4 (proof sketch is in 
Appendix E.2|. Towards that end, let X and y de- 
note the set of X-holes and V-holes. Further, let Cy 
denote the set of intervals one obtains by removing y 
from [j/min , 2/max] ■ (We also drop any interval from Cy 
that does not contain any element from S.) Further, 
given an interval I S Cy , let £ n X denote the X-holcs 
such that there exists at least one point in S that falls 
in both £ and the X-hole. 

Theorem 4.4. Given an instance R, S and T of the 
bow-tie query as in |IJ) such that R and T have size at 
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Algorithm 4 Bow-Tie Join 



Input: 2D-BST T for S, R and T as sorted arrays 
Output: (JixT)nS 



10: 
11: 

12 
13 
14 

15 
16 

17: 

18 
19 

20 
21 
22 



O <- 

Let y m i n and j/ max be the smalles t and largest values in T 
Let (r) be the state from Lemma |E, 1| that denotes the root 
node in T 

Initialize £ be a heap with (i/mim 3/maxi (r)) with the key 
value being the first entry in the triple 
W <- 

While C ^ do 

Let (£, r, P) be the smallest triple in C 
L <- [£, r] 

While traversal on T for S with y values in L using 
Algorithm [fj] is not done do 

Update P as per Lemma |E.1| 

Let (x, y) be the pair in S corresponding to the current 
leaf node 

Run the algorithm in Lemma |4,3| on (x,y) 
If (x, y) is in Case (i) then 

Add (x, y) to O 
If (x,y) is in Case (ii) with X-hole (£ x ,r x ) then 

Compute WQ£x+l,r x -l],T x ) using Algorithm|5l 

Add W{[£ x +l,r x - 1],T X ) to W 
If (x, y) is in Case (iii) with y-hole (£y,r y ) then 

Split L = LiU(£y, r y )UL2 from smallest to largest 

L <- Li 

Add (L 2 , P) into C 
Return O 



most n and are sorted in an array (or 1D-BST) and S 
has size m and is represented as a 2D-BST, the output 
O is computed by Algorithm^ in time 



O 



I I Y, + - log™- log 2 m 



We first 
X\ < \X\ = 



note that since 
hx, Theorem 



4.2 



4.4 



\C Y \ < h Y + 1 and \£ n 
immediately implies The- 

Second, we notelhat Y,eec Y ^ n X I + \°\ - 
\S\, which then implies the following: 

Coro llar y 4.5. Algorithm^ with parameters as in 
Theorem 4-2 runs in time 0{\S\ ■ log 2 mlogn). 



It is natural to wonder whether the upper bound in 



Theorem 4.2 can be improved. Since we need to output 



O, a lower bound of f2(|C|) is immediate. In Section 4.2 



we show that this bound cannot be improved if we use 
2D-BSTs. However, it seems plausible that one might 
reduce the quadratic dependence on the number of holes 
by potentially using a better data structure to keep 
track of the intersections between different holes. Next, 
using a result of Patra§cu, we show that in the worst 



case one cannot hope to improve upon Theorem 4.2 (un- 



der a well-known assumption on the hardness of solving 
the 3SUM problem). 

We begin with the 3SUM conjecture (we note that 
this conjecture pre-dates 



statement from |17|): 



17 - we are just using the 



Conjecture 4.6 ( |17|). In the Word RAM model 
with words of size O(logn) bits, any algorithm requires 
r j2-°( 1 ) time in expectation to determine whether a set 
U C {— n 3 , . . . , n 3 } o/|Z7| = n integers contains a triple 
of distinct x, y, z £ U with x + y = z. 

Patra§cu used the above conjecture to show hardness 
of listing triangles in certain graphs. We use the later 
hardness results to prove the following in Appendix [E] 

Lemma 4.7. For infinitely many integers hx and hy 
and some constant < e < 1, if there exists an al- 
gorithm that solves every bow-tie query with hx many 
X-holes and hy many Y -holes in time 0((hx ■hy) 1 ~ e + 
\0\), then Conjecture 4-6 is false. 



Assuming Conjecture |4.6[ our algorithm has essen- 
tially optimal run-time (i.e. we match the parameters 
of Theorem |4.2| up to polylog factors). 



4.2 Optimal use of Higher Dimensional BSTs 
for Joins 

We first describe a lower bound for any algorithm 
that uses the higher dimensional BST to process joins. 

Two-dimensional case. Let V be a data structure that 
stores a set of points on the two-dimensional Euclidean 
plane. Let X and Y be the axes. A box query into T> is 
a pair consisting of an X-interval and a F-interval. The 
intervals can be open or close or infinite. For example, 
{[1,5), (2, 4]}, {[1,5], [2,4]}, and {(-oo, +oo), (-oo, 5]} 
are all valid box queries. 

The data structure T> is called a (two-dimensional) 
counting range search data structure if it can return 
the number of its points that are contained in a given 
box query. And, T> is called a (two-dimensional) range 
search data structure if it can return the set of all its 
points that are contained in a given box query. In this 
section, we are not concerned with the representation 
of the returned point set. If V is a dyadic 2D-BST, for 
example, then the returned set of points are stored in a 
collection of dyadic 2D-BSTs. 

Let S be a set of n points on the two dimensional Eu- 
clidean plane. Let X be a collection of open A"-intervals 
and y be a collection of open y-intcrvals. Then S is 
said to be covered by X and 3^ if the following holds: for 
each point (x, y) in S, x £ I x for some interval I x £ X 
or y £ I y for some interval I y £ X , or both. We prove 
the following result in the appendix. 

Lemma 4.8. Let A. be a deterministic algorithm that 
verifies whether a point set S is covered by two given 
interval sets X and y. Suppose A can only access points 
in S via box queries to a counting range search data 
structure T>. Then A has to issue U,(m\n{\X\ ■ \y\, \S\}) 
box queries to T> in the worst case. 
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The above result is for the case when I? is a counting 
range search data structure. We would like to prove an 
analogous result for the case when T> is a range search 
data structure, where each box query may return a list 
of points in the box along with the count of the number 
of those points. In this case, it is not possible to show 
that A must make n(min{|5|, \X\ ■ \y\}) box queries; 
for example, A can just make one huge box query, get 
all points in S, and visit each of them one by one. For- 
tunately, visiting the points in S takes time and our ul- 
timate objective is to bound the run time of algorithm 
A. 

Lemma 4.9. Suppose T> is a dyadic 2D-BST data struc- 
ture that can answer box queries. Further more, along 
with the set of points contained in the query, suppose 
T> also returns the count of the number of points in the 
query. Let S be the set of points in T>. Let X and y be 
two collections of disjoint X -intervals and disjoint Y - 
intervals. Let A be a deterministic algorithm verifying 
whether S is covered by X and y, and the only way A 
can access points in S is to traverse the data structure 
T>. Then, A must run in time 

n{mm{\S\,\X\-\y\}). 



Now consider the bow-tie query input, where S is 



case is a tuple (Ii , • • ■ , Id) of d intervals, one for each co- 
ordinate i £ [d]. We proceed to prove the d-dimensional 



analog of Lemmas 4.8 and 4.9 



as defined in Lemma 4.9 R (and T resp.) consists of 
the end points of the intervals in X and y. Then note 
that checking whether X and y cover S is equivalent 
to checking if the bow-tie query R{X) XI S(X,Y) IX 
T(Y) is empty or not. Thus, Lemma |4.9| shows that 
Theorem 4.2 is tight (within poly log factors) even when 
= 0. 

d-dimensional case. We generalize to d dimensions. First, 
we define the natural d dimensional version of the bow- 
tie query: 

xf =1 R i (X i )MS(X u X 2 ,...,X d ). 

It is easy to check that one can generalize Algorithm [4] 
and thus, generalize Theorem |4.2| to compute such a 
query in time 0((l\t =1 h Xi + |0|) log° {d) N). Next, we 
argue that this bound is tight if we use a d-dimensional 
BST to store S. 

For the lower bound, consider the case where we have 
a point set S in K d , and a collection of d sets Xi, i £ [d], 
where for each i the set Xi is a set of disjoint intervals. 
The point set S is said to be covered by the collection 
(Xi)f =1 if, for every point (xi,-- - , Xd) £ S, there is 
an i £ [d] for which Xi belongs to some interval in Xi. 
We define counting range search and range search data 
structures in the d-dimensional case in the same way 
as in the Two-dimensional case. A box query Q in this 



Lemma 4.10. Let A be a deterministic algorithm that 
verifies whether a point set S £ Mr is covered by a col- 
lection (Xi)f =1 of d interval sets. Suppose A can only 
access points in S via d-dimensional box queries to a 
counting range search d-dimensional data structure T>. 
Then A has to issue 



n 



box queries to T> in the worst case. 

The proof of the following lemma is straightforward 
from the proof of Lemmas 4.10| and |4.9| 



Lemma 4.11. Suppose T> is a dyadic d-dimensional- 
BST data structure that can answer d-dimensional box 
queries. Further more, along with the set of points con- 
tained in the query, suppose T> also returns the count of 
the number of points in the query. Let S be the set of 
points in T>. Let Xi, i £ [d], be a collection of d inter- 
val sets. Let A be a deterministic algorithm verifying 
whether S is covered by (Xi)f =1 , and the only way A 
can access points in S is to traverse the data structure 
T>. Then, A must run in time 



Q min < — ? 

I 2 d 



We can easily generalize the argument after Lemma |4.9| 
to conclude that Lemma [4. 1 1 | implics a tight lower bound 
(up to polylog factors) to the upper bound on evaluat- 
ing the d-dimensional bow-tie query mentioned earlier 
in the section. 

4.3 Comparison with NPRR's Worst-case Op- 
timal Algorithm 

In this section, we first present a family of database 
instances for the bow-tie query where Algorithm [4] per- 
forms exponentially better than our algorithm that runs 
on the single index as well as the NPRR algorithm. 

Example 4.1. Let n > 3 be an odd integer. De- 
fine R = T =[n]\ {[n/2\, \n/2] + 1} and S = [n] x 
{[n/2j, \n/2] + l}U{[n/2j, \n/2]+} x [n]. It is easy to 
check that the example in Figure^is the case of n = 3. 
Further, for every odd n > 3, we have hx = hy = 2 
and R XI S X T = 0. 

Before we talk about the run time of different algo- 
rithms on the instances in Examples |4.1[ we note that 
we can get instance with empty output and hx = hy = 
1 (where we replace the set {[n/2j , \n/2] + 1} by just 
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{|_ro/2j}). However, to be consistent with our example 
in Figure [3] we chose the above example. 
In the appendix, we show the following: 

Proposition 4.12. Algorithm^ takes 0(log 3 n) on 
the bow-tie instances from Example \4-l\ while both the 
NPRR algorithm and our Algorithm^ take time O(n). 

In the appendix, we show that Algorithm [4] runs in 
time at most a poly-log factor worse than both NPRR 
and the Algorithm [3] on every instance. 

Proposition 4.13. On every instance of a bow-tie 
query Algorithm^ takes at most an 0(log 3 n) factor 
time over NPRR and Algorithm^ 

5. RELATED WORK 

Many positive and negative results regarding con- 
junctive query evaluation also apply to natural join eval- 
uation. On the negative side, both problems are NP- 
hard in terms of expression complexity [I] , but are easy 



in terms of data complexity 18 . They are not fix- 
parameter tractable, modulo complexity theoretic as- 
sumptions 
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On the positive side, a large class of conjunctive queries 
(and thus natural join queries) are tractable. In par- 
ticular, the classes of acyclic queries and bounded tree- 
width queries can be evaluated efficiently 5 : 8 10 20 ,21 . 
For example, if \q\ is the query size, N is the input 
size, and Z is the output size, then Yannakakis' algo- 
rithm can evaluate acyclic natural join queries in time 
0(poly(|q|)(ATlogiV + Z)). Acyclic conjunctive queries 



can also be evaluated efficiently in the I/O model 15 



and in the RAM model even when there are inequali- 
ties 
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For general conjunctive queries, while the problem 
is intractable there are recent positive developments. A 
tight worst-case output size bound in terms of the input 
relation sizes was shown in [3] . In 14 , we presented an 
algorithm that runs in time matching the bound, and 
thus it is worst-case optimal. The leap-frog triejoin al- 



gorithm 19 is also worst-case optimal and runs fast in 



practice; it is based on the idea that we can skip un- 
matched intervals. It is not clear how the index was 
built, but we believe that it is similar to our one-index 
case where the attribute order follows a reverse elimi- 
nation order. 

The problem of finding the union and intersection of 
two sorted arrays using the fewest number of compar- 
isons is well-studied, dated back to at least Hwang and 
Lin 12 since 1972. In fact, the idea of skipping ele- 
ments using a binary-search jumping (or leap-frogging) 
strategy was already present in [12] . Demaine et al. [7j 
used the leap-frogging strategy for computing the in- 
tersection of k sorted sets. They introduced the notion 
of "proofs" to capture the intrinsic complexity of such 
a problem. Then, the idea of gaps and proof encoding 



were introduced to show that their algorithm is average 
case optimal. 

Geometric range searching data structures and bounds 
is a well-studied subject [2]{^] To the best of our knowl- 
edge the problems and lowerbounds from Lemma 4.8|to 



Lemma |4.11| are not known. In computational geome- 
try, there is a large class of problems which are as hard 
as the 3SUM problem, and thus assuming the 3SUM 
conjecture there is no o(n 2 )-algorithm to solve them jjjj. 
Our 3SUM-hardness result in this paper adds to that 
list. 

6. CONCLUSION AND FUTURE WORK 

We have described results in two directions: (1) in- 
stance optimal results for the case when all relations are 
stored in BSTs where the index keys are ordered with 
respect to a single global order that respects a REO, and 
(2) we have described higher-dimensional index struc- 
tures (than BSTs) to that enable instance-optimal join 
processing for restricted classes of queries. We showed 
our results arc optimal in the following senses: (1) As- 
suming the 3SUM conjecture, our algorithms are opti- 
mal for the bow-tie query, and (2) unconditionally, our 
algorithm to use our index is optimal (in terms of num- 
ber of probes). 

We plan future work in a handful directions. First, 
we believe it is possible to extend our results in (1) 
to acyclic queries (with non-REO ordering) and cyclic 
queries under any globally consistent ordering of the at- 
tributes. The main idea is to enumerate not just pair- 
wise comparisons (as we do for acyclic queries) but to 
enumerate all (perhaps exponentially many in the query 
size) paths through the query during our algorithm. We 
are currently working on this extension. Second, in a 
relational database it is often the case that there is a 
secondary index associated to some (or all) of the rela- 
tions. While our upper-bound results still hold in this 
setting, our lower-bound results may not: there is the 
intriguing possibility that one could combine these in- 
dexes to compute the output more efficiently than our 
current algorithms. 

We would like to point out that DLM's main results 
are not for instance optimality up to polylog factors; in- 
stead they consider average-case optimality up to con- 
stant factors. Such results are difficult to compare: it is 
a weaker notion of optimality, but results in a stronger 
bound for that weaker notion. We have preliminary 
results that indicate such results are possible for some 
join queries (using DLM's techniques). However, it is an 
open question to provide similar optimality guarantees 
even for the case of bow-tie queries over a single index. 



5 We would like to thank Kasper Green Larsen and Suresh 
Venkatasubramanian for answering many questions we had 
about range search lower bounds and pointing us toward 
several references. 
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APPENDIX 

A. ON THE NON-EXISTENCE OF THE BEST DATA-INDEPENDENT TOTAL ORDER OF 
ATTRIBUTES 

This section presents a sample query that shows that there does not exist a data-independent total order of 



attributes for the algorithm in section [3. 3| Consider the query Q 2 from Example 3.1 

R{X) cxi Sx (X, Y ) c*a S 2 ( Y, Z) txi T{Z) . 

Since X and Z play the same role, we will show that for each of the following total attribute orders there exists a 
database instance for which the order is infinitely worse than another order. 

• Bad example for the (Y, X, Z) order. Consider the following instance: 

R(X) = {2, . . . , N} 
T(Z) = {2, . . . , N} 
S!(X,Y) = {l}x[N} 
S 2 {Y,Z) = [N]x{l] 

The optimal proof for the (Y, X, Z) order needs fi(iV) inequalities to certify that the output is empty; yet the 
order (X, Z, Y) needs only 0(1) inequalities. 

• Bad example for the (X, Y, Z) order. Consider the following instance: 

R(X) = [N] 
T(Z) = [N] 
Si(X,Y) = [N]x{l} 
S 2 {Y,Z) = {2}x[N] 

The optimal proof for the (X, Y, Z) order needs f2(iV) inequalities to certify that the output is empty; yet the 
order (Y,X,Z) needs only 0(1) inequalities. 

• Bad example for the (X, Z, Y) order. Consider the following instance: 

R(X) = [N] 
T(Z) = [N] 
Si{X,Y) = [N]x{l} 
S 2 (Y,Z) = {2}x[N] 

The optimal proof for the (X, Y, Z) order needs Vt{N) inequalities to certify that the output is empty; yet the 
order (Y, X, Z) needs only O(l) inequalities. 

B. MATERIAL FOR BACKGROUND 

B.l BST Background details 



B.l.l Proof of Lemma 2. 2 



PROOF of Lemma T2.21 For notational convenience, define n = f |J7|. We first argue that \W\ < O(logn). To 
see this w.l.o.g. assume that W — [n]. Thus any node v at level < i < logn, the interval is of the form 

[j ■ n/2 l + 1, (j + l)n/2 l ] for some < j < 2*. It can be checked that any interval [£, r) can be decomposed into the 
disjoint union of at most one interval per level, which proves the claim. 

Next, consider the following algorithm for computing W. We initialize W to be the empty set and call Algorithm[5] 
with the root of T, I and r. 

It is easy to check that Algorithm [5] essentially traverses through the subtree of T with W as leaves, which by our 
earlier argument implies that there are O(logn) recursive calls to Algorithm [5j The claim on the run time of the 
algorithm follows from noting that each recursive invocation takes O(l) time. □ 
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Algorithm 5 RangeFind(u, £, r) 



Input: v € T and integers I < r 

1: If [£ v ,r v ] C [l,r] then 
2: Add v toW 
3: Return 

4: Let it and ui be the left and right children of v. 

5: [i u n] <- Kr]n[4,r u ] 

6: If ^ then 

7: RangeFind(u, l\, r{) 

8: [* 2s r a ] 4- [£,r]n[t w ,r w ] 

9: If [£ 2 ,r 2 ] ^0 then 
10: RangeFind(w, ^2,7*2) 
11: Return 



B.1.2 Proof of Lemma 2.5 



Proof. For notational convenience, define n = \U\, h = \W(I,T)\ and m = \U iy2 \. 

We begin with the case when W(l2,T) = 0, then any standard traversal algorithm which starts at the smallest 
element in Ui\2 (which under the assumption is 1\ l~l U) and ends at the largest element in U\\2 in time 0{m). 



Now consider the case when h > 0. By Lemma 2.2 we can compute W = f W(l2 1 T) in 0(log n) time. By the fact 



that we store the minimum and maximum value in T v for every vertex v, in O(h) time, for each u G W, we can store 



the interval [£ u , r u ] that we can effectively remove from U D I± that we do not have to worry about. By Remark 2.3 
we can assume that these intervals are presented in sorted order. 

The rest of the algorithm is to run the usual traversal algorithm while "jumping" over the intervals [^„,r w ] for 
every v £ W. We will assume that the standard traversal algorithm, given a node u in T, one can in O(l) time 
compute the next vertex in the order. The details are in Algorithm [6j 

Algorithm 6 JumpTraverse 

Input: BST T, h, sorted "jump" intervals [£ v ,r v ] for v € W 



1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 



Assume the vertices in W by their sorted order are vi, . . . ,Vh- 
i <- 1. 

Let u be the left most leaf with value £ Vl . 

Let w be the leaf with the smallest value in 1\ n U. 

While The value at w is in Ii do 

x <— w 

If u = w then 

Let x be the rightmost leaf in T with value r Vi . 
i i + 1 

Let u be the left most leaf in T with value £ Vi . 
w is next leaf node after x in the traversal of T. 



We now quickly analyze the run time of Algorithm [6] First, note that the loop in Step 5 runs 0(m + h) times. 
Further, the only steps that are not constant time are Steps 3, 4, 8, 10 and 11. However, each of these steps can be 
done in O(logn) time using the fact that T is a BST. This implies that the total run time is 0((m + h) logn), as 
desired. □ 



Remark B.l. It can be checked that Algorithm^ (and hence the proof of Lemma 2.5) can be modified suitably to 



handle the case when given as input disjoint intervals I 2 , ■ ■ ■ , Ik such that Ij C 7 1 and we replace C7u 2 and W(L%, T) 
by Ii \ U^ =2 Ij and U^ =2 W(Ij,T). (Note that the unions are disjoint unions.) 



Remark B.2. We point out that Algorithm^ (or its modification in Remark B.l) do not need to know the intervals 



I2, ■ ■ ■ , Ih before the algorithm starts. In particular, as long as the traversal algorithm goes from smallest to largest 



15 



values and we can perform the check (i.e. is w the left most element in the "current" interval) in Step 7 in time at 
most T, then we do not need to the know the intervals in advance. In particular, by spending an extra factor of T 
in the run time, we can check in Step 7 if w is indeed the left most element of some interval Ij . If so, we run the 
algorithm from Lemma 2.2 to compute W(Ij,¥) and then we can run the algorithm as before (till we "hit" the next 
interval Ij>). 

B.I 3 2D BST Background 








v 3 ' ) 
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i 
i 









[3] 1 




Figure 4: The 2D-BST for the set U in Figure [T} The ranges in each node v represents the interval 
The number in blue near the node its the value n v . The BST on the x values has dotted lines 
while the BSTs on he y values have solid lines. Finally for each node v in the x part of the 2D-BST, 
the BST on the corresponding y values is pointed to by a green arrow from v. 



B.2 DLM Background 

Definition B.3. An element e is recursively defined to be eliminated in an argument P if either 

• (a < b) G P where e is a weak predecessor of a, and b has no eliminated predecessors; 

• (a < b) £ P where e is a weak successor of b, and a has no uneliminated successors. 

Lemma B.4. An argument is a -proof precisely if an entire set is eliminated. 

Proof. Notes that eliminated elements do not belong to the intersection set. So if an entire set is eliminated, 
then obviously the argument P is a 0-proof. 

Now suppose if the argument P is a 0-proof, we will show that there is one set which has all elements eliminated. 
Let's consider the intersection set problem with 2 sets A and B. 

Consider the following algorithm that will eliminate entirely one of 2 sets A and B 

While (A or B is not entirely eliminated ) do 

Denote A[i] and B[j] be the smallest uneliminated elements in A and B. 
If there exists a k >= j such that (B [k] < A[i]) in P then 

eliminate all weak predecessors of B [k] in B 
Else if there exists a k >= i such that (A[k] < B[j]) in P then 
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eliminate all weak predecessors of A[k] in A. 

End while 

Note that inside the loop, exact one of 2 conditions must occur because otherwise we can construct an instance 
satisfying P but its intersection is not empty. So the argument P is not a 0-proof. 

Also when one of 2 cases is implemented, one of two sets A and B has more elements that are eliminated. Because 
A and B are finite, the algorithm will stop and it shows that A or B is entirely eliminated. 

□ 

Definition B.5. A low-to-high ordering of an argument is an ordering with the property that each comparison 
(A s [i\ < A t [j]) newly eliminates elements just in A s , unless it entirely eliminates A s ( in which case it may newly 
eliminate elements in all sets). 

Lemma B.6. Every $-proof has a low-to-high ordering. 

Proof. Consider the 0-proof P of the intersection set problem of 2 sets A and B. 
We will construct the low-to-high ordering 0-proof P' from P as follows: 

Initialize P' is empty 

While (A or B is not entirely eliminated) do 

Denote A[i] and B[j] be the smallest uneliminated elements in A and B. 

If A[i] > B[j] then 

add all comparisons (B[k] < A[i]) in P to P' 

Else add all comparisons (A[k] < B[j]) in P to P' . 
End While 

Add all remaining comparisons in P to P'. 

Obviously P' is the low-to-high ordering 0-proof. 

□ 

C. ALGORITHMS AND PROOFS FROM SECTION 
C.l The 1BST Case 

In this section, we will consider the following problem: Given 3 relations R(X), S(X,Y), and T(Y), we want to 
compute R(X) X S(X, Y) X T(Y) which is called a bow-tie query. In this section, the relation R and T are sorted 
by X and Y, respectively. Also the relation S is sorted by X, then by Y. So we call this query bow-tie query in one 
index case. 

C.l. I Proof structure of bow-tie query in one index case 
The main idea to compute the bow-tie query is as follows: 

• First we will compute W(X,Y) = R(X) X S(X,Y). Then for every x, denote W[x\ = {(x,y)\(x,y) is a tuple 
of W}. Then it is easy to see that W is partitioned into k disjoint relations W[a;i], M / [x 2 ], . . . , Vt^Zfc]. 

• The result join of bow-tie query is the union of all W[xj] X T(Y). 

Notice that we already know how to compute the query W(X,Y) — R(X) X S(X,Y), which is a hierarchical 
query that is shown in the Hierarchical Query Section. In the following section, we will focus on how to compute 
the query (U*U W[x t ]) X T{Y). 

C.l. 2 Support for bow-tie query in one index: Union of Intersections Problem 

We consider the following problem: Given a set S and a collection of k other sets A\, A2, ■ . . , Ak that are sorted 
in the ascending order, we want to compute 

k 

R=\J(SnAi). 

1=1 
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Remark 1: Instead of only output all elements in R, for every element r in R, we also want to output all 
occurrences of r in every Ai,i = 1, . . . , k. 

We call this problem the union-intersection problem in this section. 

In the following subsections, we will show the algorithm that generates the minimum proof. Then we show that 
the minimum proof that just involves the comparisons between S and Ai should be optimal within a constant factor. 
Finally, we describe the algorithm that computes the union-intersection problem and has optimal running time 
(within a log factor). 

C.1.3 Finding the minimum proof 

To make the algorithm clean, without loss of generality, we will assume that S[l] < min{Ai[l], ^[l], . . . , ^4fc[l]}. 
Consider the following algorithm that generates the minimum proof for the union-intersection problem. 

Algorithm 7 Bow-tie Query Fewest-Comparisons 

Input: Instance of union-intersection problem R = {J^ =1 (S H Ai) 
Output: A proof P with fewest comparisons, i.e., \P\ — T>. 



1: While S is not entirely eliminated and one of Ai, i = 1, . . . , k is not entirely eliminated do 

2: Denote , ai 2 , . . . , a,i m be the minimum uneliminated elements of Ai 1 , Ai 2 , . . . , Ai m , respectively. 

3: Denote a = minjajj , . . . , dj m }. 

4: In the set S, search for the maximum element s such that s < a. 
5: If s is the last uneliminated element of S then 

6: for every j = 1, . . . , m, add the appropriate comparison (s < a«.) or (s = a^) to the proof P. 

7: Stop the algorithm. 

8: Let s' be the successor of s in S. 

9: Also suppose that a^, . . . 1 a.j v are all elements in {a^, . . . ,a; m } such that aj t < s' for every t — I,..., p. 
10: For each t = 1, . . . ,p, in the set Aj t , search the maximum element a'j t such that a' t < s' . 
11: Sort aj 1 , Qj-j , . . . , aj p , a'j p in the ascending order. Suppose that after sorted, those elements are 6i, . . . , 6 9 PJ 
12: For notational simplicity, denote bo = s and b q+ \ = s' . 
13: Add the comparison (bo < b\) to the proof P. 
14: Add the comparison (b q < b q+ i) to the proof P. 
15: For t = 1, . . . ,q- 1 do 

16: Consider two elements 6 t and we have the following cases: 

17: If there is no i in [0,t] such that 6^ and b t +\ belongs to the same set and no comparison (b < b t +i) is 

added so far, then add the comparison (b t < &t+i) to the proof P. 
18: If there is no i in [t + 1, q + 1] such that bt and bi belongs to the same set, find the smallest index 

l,t + 1 < I < q such that there is no j,j < I and bj and b[ belongs to the same set. Also no comparison (b < b[) 

is added so far. Add the comparison (b t < b{) to the proof P. If no such I is found, add the comparison (b t < s') 

to the proof P. 

19: Otherwise, nothing is to be added to the proof P. 

20: Mark all elements < s in S as eliminated. 

21: For every t = 1, . .. ,p, in the set Aj t , mark all elements < a'j t as eliminated. 



Theorem C.l. For any given instance R = {J^ = i(S C\ AA, the algorithm Fewest- Comparisons generates the proof 
for the union-intersection problem with the fewest comparisons possible. 

Proof. We show that in each case of the algorithm, it generates the proof with the fewest comparisons possible. 
Consider the line [5] of the algorithm. If s is the last uneliminated element of S, then for every j = 1, . . . ,m, then 
the fact the ai- is bigger or equal to s must be specified. Otherwise, P is no longer a valid proof for this instance. 
It is not difficult to see that m is the most optimal number in terms of number of comparisons needed to add to the 
proof P to make it valid. So line [5] is efficient. 

Consider lines [8p9] in the algorithm. In this case, for every i — 1, . . . ,q, the valid proof P must be able to show 
which of the following facts are true: hi is bigger to s, bi is equal to s, bi is smaller than s', and bi is equal to s' . 

We will prove by induction that at each step in line |17| the algorithm tries to construct the minimum proof by 
adding those comparisons. Suppose that this fact is true at step t — 1. Consider the step t. 

• Let's look at case a) in line 17 In this case, one of the comparisons (bo < b t +i), . . . , (b t < &t+i) must be added 
to the proof P. Otherwise the proof P is no longer valid because we can not determine whether 6 t+1 = s or 
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b t+ i > s. We can see that if some other minimum proof Q that decides to add (bi < b t +i) to the proof, then 
we can replace that comparison by (bt < £>t+i) and the proof Q is still valid. So in this case, choosing the 
comparison (b t < &t+i) is optimal. 



• Now consider case b) in line 18 We can see that one of the comparisons (b t < &t+i), • • • , (bt < b q +\) must 
be added to the proof P. Otherwise the proof P is no longer valid because we can not determine from P 
whether b t — s' or b t < s' . Suppose that I is found and the comparison (b t < &;) is added to the proof P. Now 
suppose that the other minimum proof Q chooses to add the comparison (b t < bh)- Obviously there exists one 
comparison (b < bi) in Q. If bh < h then by replacing (b t < bh) by (b t < 6;), the proof Q is still valid. If bh > 6/ 
then we can replace two comparisons (b t < bh) and (b < b{) in Q by (b t < b{) and (b < bh) respectively. Note 
that the proof Q is still a valid minimum proof. So at this case, choosing the comparison (bt < bi) is optimal. 
In summary, the algorithm generates the proof with the fewest comparisons. 

□ 

Denote 2? be the number of comparisons of the minimum proof that is generated from the algorithm Fewest- 
Comparisons. Then 2? is the lower bound of running time of any algorithm that computes the union-intersection 
problem. The following theorem shows the upper bound of the size of the minimum proof that contains only the 
comparisons between S and Aj, and it turns out that this proof is optimal within the constant factor. 

Theorem C.2. For any instance of union-intersection problem, the number of comparisons in minimum proof 
that contains only the comparisons between S and A^, i = 1, . . . , k is no more than 22?. 

Proof. Let P be a minimum proof and Q be a minimum proof that contains only the comparisons between S 
and Ai,i = 1, . . . , k. Then we will show that in each phase in the algorithm Fewest-Comparisons, the number of 
comparisons in Q is no more than two times the number of comparisons in P (1) 

Consider line[5]in the algorithm, obviously \P\ = \Q\, so the fact (1) holds. Consider lines 8p9 of the algorithm, 



suppose that in {ii, 12, ■ ■ ■ , i m } there are exactly to indices t such that at < a' t . So from now in this proof, we call 
t as an index that has such property and I as an index such that a; = ai>. So we have m such t and (jp — to) such 
that index I. 

For every t, 2 comparisons (s 9 a t ) and (a' t 9 s') should be in Q. Also for every I, at most 2 comparisons (s 9 ai) 
and (ai 9 s') should be in Q also (when (a/ = s) or (a; = s') then there is only one of them in Q). (Here 9 € {<, =}) 
So there are at most 2m + 2(p — to) = 2p comparisons in the proof Q. 

Now we will estimate the lower bound of the number of comparisons in P. For every t, from the proof P, it can 
be determined whether a t > s or at = s. So in P, there are at least one comparison (b 9 a t ). Because otherwise, it is 
free to set a t to be equal or greater than s and also satisfies the proof P; in other words, P is no longer valid. Also 
for every I, if a/ < s' then in P, there is at least one comparison (b 9 a{) so that we can determine whether a; > s or 
ai = s from the proof P. If a; = s' then there is at least one comparison (ai = b) so that we can verify ai — s' from 
the proof P. 

Notice that all comparisons we describe above are pairwise different. So there are at least to + (p — to) = p 
comparisons. And so, \Q\ <2p< 2\P\. 

□ 

Theorem C.3. Given an instance R = [Ji = i(Sr\Ai) of the union-intersection problem, denote N — max{\S\, \Ai\, . . . , |^4/c|}- 
Then R can be computed in time 0(T>\ogN). 

Proof. Consider the following algorithm Union-Intersection to compute R 
Algorithm 8 Union-Intersection 

Input: A set S and k sets Ai, i = 1, . . . , k sorted in the ascending order. 

Output: R = \J- =1 (S n Ai) ( all occurrences of a value in the result need to be output) 



For i = 1, . . . , k do 

Compute A • = S n Ai using the Set Intersection Algorithm 

return R = Ai U • • • U A k . 



We will show that Algorithm Union-Intersection (Algorithm [8| can compute R in 0(2? log N) time. For every 
i = 1, . . . , k, denote 2?^ be the number of comparisons of the minimum proof of the intersection problem S (1 A4. 
Then by using Set Intersection Algorithm, we can compute A\ — S n Ai in 0(T>i log A) time. 
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Also by Theorem [02| T>\ H V V k < 2V, so computing all A[ takes 0(V\ogN) time. 

By allowing duplications in R, computing R by just outputting all elements in Ai, . . . , A k takes \Ai \ +• • • + \A k \ < 

\T>i \ H h \ V k \ < 2V time. 

So R can be computed in 0(T>\og N) time. □ 

C.1.4 Bow-tie query in one index case 



Algorithm 9 Bow- Tie-One-Index 

Input: Relations R(X) and T(Y) sorted by X and Y respectively. Relation S(X, Y) is sorted by X, then by Y . 
Output: U{X,Y) = R(X) X S{X,Y) X T(F) 
1: W(X, Y) = R(X) XI S(X, Y) by using the hierarchical query algorithm. Partition W(X, Y) into k relations 
W\, . . . , Wk such that for every i = 1, . . . , k, in relation Wi, all tuples have the same X attribute and Wi are 
sorted by Y in ascending order. 
2: U(X, Y) — (Ji=i(Wi 1x1 T) using the algorithm Union-Intersection 
3: return U 



Remark 1: For every i = 1, . . . , k, the algorithm does not have to compute Wi explicitly by adding all satisfied 
tuples to Wi. Instead, it just has to compute the low index I and high index h such that Wi will contains all tuples 
S[l], . . . , S[h\. By using the hierarchical query algorithm, it can be obtained and also those tuples are sorted by Y 
attribute. 

Remark 2: Denote T>i be the number of comparisons of the minimum proof in the hierarchical join query 
problem W(X, Y) = R(X) M S(X,Y). Denote T> 2 be the number of comparisons of the minimum proof in the 
Union-Intersection problem U(X,Y) = {J i=1 (Wi ixi T). So T> = T)\ + T> 2 is the number of comparisons of the 
minimum proof in the Bow-Tie query in one index case. 

We will show that the Algorithm Bow- Tie-One- Index (Algorithm [9]) is optimal within a log N factor. 

Theorem C.4. Given an instance U(X,Y) — R(X) X S(X,Y) 1X1 T(Y) of the bow-tie query in one index 
case problem; R and T are sorted by X and Y respectively, S is sorted by X, and then by Y . Denote N = 
max{\R\, \S\, \T\}. Then U can be computed in time 0{T>\og N) time. 

Proof. By using the hierarchical join query algorithm, we can compute all Wi in 0(T)\ log TV) time. 

Also by Remark 1, Wi is sorted by Y. So T,Wi, . . . ,Wf. are satisfied to be applied in the algorithm Union- 
Intersection to compute U. By Theorem C.3 we can compute U in time 0(T>2\ogN). 

In summary, U can be computed in time 0(1? log N). □ 

C.2 An Example where Simple Modification of Algorithm [3] Performs Poorly 

Let Q 2 join the following relations: 



R(X) = [N], S 1 (X,X 1 ) = [N] x [N], S 2 {X 1: X 2 ) = {{2,2)}, and T(X 2 ) = {1,3} 

Also assume that all relations are sorted in a single index with that order. We run the DFS-Style join algorithm 
to see how it performs in this example. 

The order of attributes are X, Xi,X 2 . So for each step, we will search the output tuple by one attribute at a time. 
First, R[l] = 1 and Si[l] = [1, 1] match by attribute X. But when it considers the relation S2, it does discover that 
there is no tuple in S2 with X\ = 1. So the algorithm backtracks with the next tuple in Si, that is Si[2] = (1,2). 
Now Si [2] matches with the tuple S^l] = (2, 2). When comparing with X 2 in T, we have the following comparisons: 
S , 2[1].A2 > T[1].A2 and S 2 [l].X 2 < T[2].X 2 . By those comparisons, the algorithm knows that S^l] does not match 
any X 2 in T. So it backtracks with the next tuple in Si, that is Si [3]. And it continues with the same fashion. So 
the above DFS-style algorithm will take fl(N) steps to figure out that the result join is empty. 

On the other hand, the proof of this instance includes only 2 inequality comparisons S2[1].A2 > T[1].X2 and 
S 2 [1].A 2 < T[2].X 2 . This proof shows that S 2 1X1 T(X 2 ) is empty, and hence the Q 2 is empty But the DFS-style 
algorithm does not remember those constraints. As a result, it keeps joining among R and Si, and takes Q(N). 

D. ALGORITHMS AND PROOFS FOR SECTION S3] 

Before we present our main algorithm in Algorithm[l2j we will first present its specialization for the set intersection 
and the bow-tie query. These specializations are different from Algorithm [l] and [3] respectively. However, unlike the 
previous algorithms these specialization clearly demarcate the role of the "main" algorithm and the data structure to 
handle constraints. Further, these specializations will help illustrate the main technical details in our final algorithm. 
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D.l Computing the intersection of m sorted sets 

The purpose of this section is two-fold. First, we would like to introduce the notion of certificate (called proof in 
DLM) that is central in proving instance-optimal run-time of join algorithms. Second, by presenting our algorithm 
specialized to this case, we are able to introduce the "probing point" idea and provide a glimpse into what the "ruled 
out regions" and "constraint data structure" are. 

Consider the following problem. We want to compute the intersection of m sets Si," - ,S m . Let n, ; = \Si\. We 
assume that the sets are sorted, i.e. 

St[l] <S t [2] < ••• <Si[rii\,Vie [m]. 

The set elements belong to the same domain D, which is a totally ordered domain. Without loss of generality, we 
will assume that D = [N]. One can think of [N] as actually the index set to another data structure that stores 
the real domain values. For example, suppose the domain values are strings and there are only 3 strings this, is, 
interesting in the domain. Then, we can assume that those strings are stored in a 3-element array and N = 3 in 
this case. 

In order to formalize what any join algorithm "has to" do, DLM considers the case when the only binary operations 
that any join algorithm can do are to compare elements from the domain. Each comparison between two elements 
a, b results in a conclusion: a < b, a > b, or a = b. These binary operations are exclusively used in real-world join 
implementations. It is possible that one can exploit, say, algebraic relations between domain elements to gain more 
information about them. We are not aware of any algorithm that makes use of such relationships. 

After discovering relationships between members of the input sets, a join algorithm will have to output correctly 
the intersection. Consequently, any input that satisfies exactly the same collection of comparisons that the join 
algorithm discovered during its execution will force the algorithm to report the "same" output. Here by the "same" 
output we do not mean the actual set of domain values; rather, we mean the set of positions in the input that 
contribute to the output. For example, suppose the algorithm discovered that Si[i] = 5*2 [i] = • • • = S m [i] for all i, 
then the output is {Si[l], Si [2], ■ • • }, whether or not in terms of domain values they represent strings or doubles or 
integers. In essence, the notion of "same" output here is a type of isomorphism. 

The collection of comparisons that a join algorithm discovers is called an argument. The argument is a certificate 
(that the algorithm works correctly) if any input satisfying the certificate must have exactly the same output. More 
formally, we have the following definitions 

Definition D.l. An argument is a finite set of symbolic equalities and inequalities, or comparisons, of the 
following forms: (1) [S s [i] < St[j}) or (2) S s [i] = St[j] fori,j > 1 and s,t G [m). An instance satisfies an argument 
if all the comparisons in the argument hold for that instance. 

Arguments that define their output (up to isomorphism) are interesting to us: they are certificates for the output. 

Definition D.2. An argument P is called a certificate if any collection of input sets Si, ... , S m satisfying P must 
have the same output, up to isomorphism. The size of a certificate is the number of comparisons in it. The optimal 
certificate for an input instance is the smallest-size certificate that the instance satisfies. 

As in DLM, we use the optimal certificate size to measure the information-theoretic lower bound on the number 
of comparisons that any algorithm has to discover. Hence, if there was an algorithm that runs in linear time in the 
optimal certificate size, then that algorithm would be instance-optimal with respect to our notion of certificates. 

Our algorithm for set-intersection. Next, we describe our algorithm for this problem which runs in time linear in 



the size of any certificate. Algorithm 10 is a significant departure from the fewest- comparison algorithm in DLM (i.e. 
Algorithm [TJ . In fact, our analysis can deal directly with the subtle issue of the role of equalities in the certificate 
that DLM did not cover. We will highlight this issue in a later example. 

The constraint set and constraint data structure. The constraint set C is a collection of integer intervals of the form 
[£, h], where 0<l<h<N + l. The intervals are stored in a data structure called the constraint data structure 
such that when two intervals overlap or adjacent, they arc automatically merged. Abusing notation, we also call the 
data structure C. We give each interval one credit, 1/2 to each end of the interval. When two intervals are merged, 
say is merged with [^2,^2] to become [£i,h 2 ], we use 1/2 credit from hi and 1/2 credit from £ 2 to pay for 

the merge operation. If an interval is contained in another interval, only the larger interval is retained in the data 
structure. By maintaining the intervals in sorted order, in 0(l)-time the data structure can either return an integer 
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Algorithm 10 Computing the intersection of to sets 



Input: to sorted sets Si, • • • , S m , where \Si\ = rii, i € [to] 



1: Initialize the constraint set C <— 

2: For i = 1, . . . , m do 

3: Add the constraint [^[nj + 1, JV + 1] to C 

4: Add the constraint [0, Si[l] - 1] to C 

5: While C can still return a value t do > t G [N] 

6: For i = 1, . . . , m do 

7: Let ej = Si[xj] be the smallest value in Si such that ej > t 

8: Let = Sjfx^] be the largest value in Si such that e[<t > It is possible that ej = e\ and x$ = x\ 

9: If ej = t for all i £ [to] then > Then all e[ — t too 

10: Output t 

11: Add constraint [t,t] to C 

12: else 

13: For each i G [to] such that e< > t do > < t and x^ = Xj — 1 for such i 

14: Add constraint \e\ + 1, — 1] to C > This is similar to the 2D-BST case 



£ € [0, iV + 1] (our probing point) that does not belong to any interval, or correctly report that no such t exists. In 
other words, each query into C takes constant time. Inserting a new interval into C takes 0(log n)-amortized time 
where n is the maximum number of intervals ever inserted into C, using the credit scheme described above. 

Certificate. The optimal certificate, or any certificate for that matter, is a set of comparisons of the form Si [xj] 9 Sj [xj] , 
where 6 G {=, <, >}, i,j G [to], x, G [nj, and Xj G [rij], such that any input satisfying the certificate will have the 
same output, up to isomorphism. 



The following theorem states that Algorithm 10 has a running time with optimal data complexity (up to a log n 
factor) . 



Theorem D.3. Algorithm 10 runs in time 0(m|P| log n), where P is any certificate for the instance, and n = 

i=l n i- 

Proof. We first upper bound the number of iterations of the algorithm. The key idea is to "charge" each iteration 
of the main while loop to a pair of comparisons in the certificate P such that no comparison will ever be charged 
more than a constant number of times. Each iteration in the loop is represented by a distinct t value. Hence, we 
will find a pair of comparisons to pay for t instead of paying for the iteration itself. 

Let t be a value in an arbitrary iteration of Algorithm 10 Let ei,e£,x.; and x\ be defined as in lines [7] and [8] of 
the algorithm. 

If ej = t for all i G [to], then P has to certify this fact with at least to — 1 equalities of the form Si[xi] = Sj[xj]. (If 
we think of the <Sj[xj] as nodes of a graph and the equalities among them as edges, then we must have a connected 
graph to certify the output t.) It is easy to see that if there were no such equalities then we can slightly perturb the 
input instance to still satisfy P but t is no longer an output. In this case, we pay for t by charging any of the (at 
least) to — 1 equalities in the certificate. 

Now, suppose a > t for some i, i.e. t is not an output. (Note that by definition it follows that e[ = Si[xJ ; ] < t.) 
For each i G [to], the member ^[Xj] is said to be t-alignable if Si[xj] is already equal to t or if Si[xi] is not part of 
any comparison in the certificate P. Similarly, we define the notion of t-alignability for S^fx^], i G [to]. 

When Si[xi] is i-alignable, setting Si[xi] — t will not violate any of the comparisons in the certificate. Similarly, 
we can transform the input instance to another input instance satisfying P by setting S^xJ] = t, provided Si[x£] is 
t-alignable. 

Claim: if t is not an output, then there must exist some i G [to] for which both Si[xj] and 5^[x'] are not t-alignable. 
Otherwise, by assigning one t-alignable end in each pair Si[xj] and Si[x^] to be t, we obtain a new instance satisfying 
the certificate P but this time t is an output. 

We will pay for t using any comparison involving Sj[xj] and any comparison involving Sj[x4]. Since they are not 
t-alignable, each of them must be part of some comparison in P. Since we added the interval [ej + 1, — 1] to the 
constraint set C, in later iterations t will never hit the same interval again. Each comparison involving one element 
will be charged at most 3 times: one from below the element, one from the above the element, and perhaps one 
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when ei is output. 

We already discussed how the constraint data structure C can be implemented so that insertion takes amortized 
logarithmic time in the number of intervals inserted, and querying (for a new probing point t) takes constant time. 
Since the number of intervals inserted is at most n — 2~2i n i> eacn iteration of the algorithm takes time at most 
O(mlogn). □ 

D.2 Instance Optimal Joins with Traditional B-Trees 

In this section, we consider the case when every relation is stored as a single B-tree. The B-tree for each relation is 
assumed to be built consistently with a global attribute order. For example, if the global attribute order is Ai, ■ ■ ■ ,A n , 
and R(A2, A$, Aq) is a relation, then the B-tree for relation R has three levels, the first indexed by A2, the second 
indexed by A$, and the third indexed by Aq. 

We will start with the first non-trivial query, the bow-tie query in order to illustrate some of the core ideas in the 
algorithm, the constraint data structure, and their analyses. Then, we present the general algorithm and constraint 
data structure. Finally, if the global attribute order is the reverse elimination order for the input acyclic query, then 
algorithm is instance-optimal in terms of data complexity. Our algorithm can be turned into a worst-case linear 
time algorithm with the same guarantee as Yannakakis' classic join algorithm for acyclic queries. In Appendix [Gj 
we will give an example showing that our algorithm can be much faster than the recent leapfrog triejoin algorithm. 

D.2. 1 Our algorithm specialized to the how-tie query 

The first non-trivial query is the following which we call the bow-tie query: 

q^ = R{X)MS(X,Y)MT(Y). 

The global attribute order, without loss of generality, can be assumed to be (X, Y). The indices can have different 
ranges. For example, R[i] is the ith value in R, where i is the index and the value belongs to its own domain. 
Similarly, T[j] is the jth value in T. As for S, we will use the following notation. S[i] is the ith X-value in S, and 
S[i,j] is the jth Y- value among all tuples (x,y) G S with x = S[i]. We want to clearly distinguish between the 
indices into the relations and the values which belong to the domain [N] . The reason is that the optimal certificate, 
or any certificate for that matter, only contains indices and no value in the domain. 

Structure of a certificate. Any certificate for the bow-tie query is a set of comparisons in one of the following three 
formats: 

R[%r] S[i a ], 
S[i s ,j s ] T[j t ], 
S[i s ,j s ] 6 S\i' s ,f s ]. 

where 8 G {<,=,>} is called a comparison. 

Example D.l. To see the importance of the third comparison format, consider the following example. 

R = [n] 

S = [n] x{2i\ie [n]} 
T = {2*-l I *G [n]}. 
The following comparisons form a certificate verifying that the output is empty: 

R[i] = S[i], i G [n] 
S[i,j] = S[l,j], i,j G [n],i > 1, 

T[l] < S[l, 1] < T[2] < S[l, 2] < • • • < T[n] < S[l, n}. 

The last In— 1 inequalities verify that the sets S[l, *] andT do not intersect. There are totally n+n(n— l) + (2n— 1) = 
n 2 + 271 — 1 comparisons in the above certificate. 

On the other hand, if we do not have comparisons between tuples in S (i.e. comparisons of the kind S[i Sl j s ] 6 S[i' s ,f s ] ), 
we must at least assert for every i that the sets S[i, *] and T do not intersect, for a total of at least n(2n— 1) = 2n 2 — n 
comparisons. 

It is not true that disallowing comparisons between tuples from the same relation will only blow up the certificate 
by a constant factor. Consider the following example. 
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Example D.2. Consider the following query, which is not a bow-tie query. 

R(A,C) M S(B,C), 

where 

R = [n] x{2k\ke [n]} 
S = [n] x {2k- 1 | k g [n]} 

The join is empty, and there is a certificate of length 0(n 2 ) showing that the output is empty. The certificate consists 
of the following comparisons: 

i?[l,c] = R[a,c], for a,c G [n],a > 1 
S[l,c] = S[b,c], forb,ce[n},b>l, 

S[l,l] < #[1,1] < S[l,2] < R[l,2] <■■■< S[l,n] < R[l,n}. 

If we don't use any equality, or if we only compare tuples from different relations, any certificate will have to be of 
size fi(n 3 ) because it will have to "show" for each pair a,b that R[a, *] n S[b, *] = which takes 2n — 1 inequalities, 
for a grand total of n 2 (2n — 1) = Q(n 3 ) comparisons. 

Note that our algorithm described below does run in time 0(n 2 ) for this instance. 



The constraint data structure. Every constraint is one of the following forms: ([£, h],*), (= p, [£, h]), and (*, [£, h]). 
A tuple t = (x, y) satisfies the first form of constraint if x G [£, h], the second form if x — p and y G [£, h], and the 
third form if y G [£, h]. Each constraint can be thought of as an "interval" in the following sense. The first form of 
constraints consists of all two-dimensional (integer) points whose X-values are between £ and h. We think of this 
region as a 2D-interval (a vertical strip). Similarly, the second form of constraints is a ID-interval, and the third 
form of constraints is a 2D-interval (a horizontal strip) . 

We store these constraints using a two- level tree data structure. In the first level (corresponding to X-values), 
every branch is marked with an *, — p, or an interval [£, h]. In the second level, every branch is marked with [£, h]. 
If the first level of a branch is an interval [£, h] , then there is no second level under that branch. Intervals under the 
same parent node are merged when they overlap. We use the previous trick of giving the low end and the high end 
of an interval half a credit to pay for the merging. If the second level of a (= p)-branch covers the entire domain, 
then the (= p)-branch is turned into a *) constraint that can further be merged (at the first level). We will 

give half an extra credit to each end point of an interval under a (= p)-branch so that when the branch is turned 
into a *) branch both of the end points has half a credit as any other interval on the first level. 

Inserting a new constraint takes amortized logarithmic time, as we keep the branches sorted, and the new constraint 
might "consumes" existing intervals. In amortized logarithmic time, the data structure C has to be able to report 
a new tuple t = (x, y) that does not satisfy any of its constraints, or correctly report that no such t exist. To find 
t, we apply the following strategy: 

• We first find x such that x does not belong to any first level interval and does not match any p in the (= p)- 
branches. This value of x, if it exists, can easily be found in logarithmic time. If x is found, then we find y that 
does not belong to any second level interval on the ^-branch. If there is no y then no such t exist, the algorithm 
terminates. 

• If x from the previous step cannot be found, then the first-level intervals and the values p in the (= p)-branches 
cover the entire domain {0, • • • , N + 1}. In this case, we will have to set x to be one of the p's, and find a value 
y under the (= p)-branch that does not belong to any interval under that branch. The tuple t = (x, y) might 
still violate a (*, [£, h]) constraint. In that case, we insert the constraint (= p, [£, h]) into the tree. Then we find 
the next smallest "good" value y under the (= p)-branch again. The intervals under (= p)-branch might be 
merged, but if we give each constraint a constant number of credits, we can pay for all the merging operations. 

To summarize, insertion and querying into the above data structure takes at most logarithmic time in the amortized 
sense (over all operations performed). 



The algorithm. Algorithm II solves the bow-tie join problem. Notations are redefined here for completeness. Let 
\R\, \S\, \T\ denote the number of tuples in the corresponding relations, S[*] the set of X-values in S, S[i] the i'th 
X-value, S[i, *] the set of y's for which (S[i],y) G S, and S[i,j] the j'th Y- value in the set Figure [5] illustrates 

the choices of various parameters in the algorithm. 
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Figure 5: Illustration for Algorithm |11| 



Theorem D.4. Algorithm 11 runs in time 0(|P|logn) } where P is any certificate, and n is the input size. 

Proof. We pay for each iteration of the algorithm, represented by the tuple t, by charging a pair of comparisons 
in the certificate P. Each comparison will be charged at most O(l) times. To this end, we define a couple of terms. 

Any tuple e € S[i s ], S[ig],T[ij],T[ij]} is said to be t-alignable if either e is already equal to x or 

e is not involved in any comparison in the certificate P. If a t-alignable tuple e is not already equal to x, setting 
e — x will transform the input into another input that satisfies all comparisons in P without violating the relative 
order in the relation that e belongs. A tuple e g {S[i s , i s e ], S[i s , is h }} is t-alignable if S[ig] is t-alignable and either 
e is already equal to y or e is not part of any comparison in the certificate P. Similarly, we define t-alignability for 
a tupfeee 

Next, we describe how to pay for the tuple t. 
Case 1. Line[9]is executed. Let / be the collection of pairs (i,j) for which we can infer the relation S[i, j] — S[i s 



■hh] 

using equalities in the certificate P. In particular, (is, is' 1 ) e I- There must be at least |/| — 1 equalities for this 
inference. There must also be an equality S[i,j] = T[ij] in the certificate for some € I, otherwise setting 

= y + e for all € I will change the output; in particular, t will no longer be in the output, yet the new 
input instance still satisfies P. There are totally \I\ equalities involved. We can use one of these equalities to pay 
for t. If later on t' = (S[i\, S[i, j}) is an output for another € I, we use a different equality to pay for that 

output. In total, each equality in the above |/| equalities will be charged at most once for each output of the form 
(S[i\, S[i, j]). The constraint added in line 10 ensures that we won't have to pay for the same output t again. 



Case 2. The else part (line 11 ) is executed. We claim that one of the following five cases must hold: 



(1) both R[iji\ and R[i e R ] are not t-alignable, 

(2) both S[ig] and S[i e s ] are not t-alignable, 

(3) both S[i s ,if] and S[i^,i' s h ] are not t-alignable, 

(4) both S[ig,ig] and Sli^ig 1 ] are not t-alignable, 

(5) both T[t§,] and T[i l T ] are not are not t-alignable. 

Suppose otherwise that at least one member in each of the five pairs above is t-alignable. For example, suppose 
the following tuples are t-alignable: R[i%], S[i e s ], S[i s , i l Jr ], T[ij,]. Then, by setting R[i R ] — S[i e s ] = x and S[i s , i^} = 
T[ij] = y, we obtain another input instance satisfying all comparisons in the certificate but t is now an output. 

By definition, each tuple e that is not t-alignable must be involved in a comparison in the certificate P. Instead of 
charging a comparison, we can charge a non t-alignable tuple. If each non t-alignable tuple is charged O(l) times, 
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Algorithm 11 Evaluating the bow-tie query R(X) X S(X,Y) X T(Y). 
1: Initialize the constraint set C with the following constraints 
. ([R[\B\] + 1,N + 1],*), ([0,Ji[l]-l],*> 
. ([T[\T\} + l.JV + l],*), ([0,T[1]-1],*) 

• constraints representing all tuples strictly below the smallest tuple in S, lex order 

• constraints representing all tuples strictly above the largest tuple in S, lex order 

2: While C can still return a tuple t = (x, y) do > t G [N] x [TV] 

3: i^j <- min{i | i?[i] > x}, i R max{i | R[i] < x} 
4: i^ <— min{i | T[i] > y}, iy 4— max{i | T\i] < y} 
5: ig <— min{i | <S*[z] > x}, i4 -s— max{i | S[i] < x} 
6: i% h <- min{i | S[i s , i] > y}, if <- max{i | i] < y} 
7: if <~ min {* I S[i l s , i] > y}, if <- max{i | S[i s , i) < y}, 
8: If R[i h R ] = S[i s ] = x and S[i s , i h s h \ = T[i%] = y then 
9: Output the tuple t = (x, y) 

10: Add constraint (= x, [y,y]) to C 

11: else > Interval (a, b) is the same as [a + 1, 6 — 1]) 

12: C^Cu{((fl[iU^]),*), <(^|],5[z|]),*), <*,(T[4],r[4])>}; 

13: C^CU{(= (5[i^i«],5[4,i^])) , (= S[i s ], {S\i s ^ s %S\i%,if }))} 



then each comparison will be charged 0(l)-times. We pay for t by charging any pair of non t-alignable tuples out 
of the five pairs above. We call each of those five pairs an "interval." The pairs of the type (1), (2), and (5) are 
2D-intervals, and the pairs of the type (3), (4) are ID-intervals. 

Due to the constraints added on lines [12] and [l3j the 2D-intervals are charged at most once. Since two 2D-intervals 
might share an end point, each tuple from a 2D-interval might be charged twice. The ID-intervals are charged at 
most twice. Each tuple from a ID-interval might be charged at most four times. □ 

Thus, for bow-tie queries with a single index we get instance optimal results up to poly log factors. 

D.2.2 Algorithm and constraint data structure for general queries 

This section generalizes the above ideas to the general case when we want to compute a natural join query of the 
form 

q = Ri(Ai) M R 2 (A 2 ) M ■ • • M R m (A m ). 

To simplify notations, we will try not to use subscripts as much as possible. Let atoms((?) denote the set of relation 
symbols for query q. (In the above query, atoms(<7) = {Ri,--- ,R m }, but we will not use subscripts to indicate 
relations any more.) 

Global attribute order. Let A\, A 2 , ■ ■ ■ , A n be sequence of all attributes of q. This sequence is the global attribute 
order which the B-tree index for each input relation has to respect in the following sense. For each relation 
R G atoms (g), let arity(i?) denote the arity of R. Fix one such relation R G atoms(g) with arity(i?) = k. 
Let A(R) = (ai(R), a 2 (R), ■ ■ ■ ,ak(R)) £ [n) k be an integer tuple such that the relation R has k attributes 
A ai r R \, A a2 / R -\, ■ ■ ■ ,A ak t R \ where ai(R) < a 2 (R) < ■ ■ ■ < a^R). Similar to the previous section, R[i] denotes 
the ith j4 ai (^)-value in relation R, sorted in increasing order, R[i,j] denotes the jth A a2 r R \-\aXue among all tuples 
of R whose ^4 ai (/j)-value is R[i], and so forth. More generally let i = (ii, • • • , ij) be any j-ary positive integer vector 
with j < k, then R[i] denotes the ij 'th A a .(^)-value among all tuples whose A ag ( R yvalue is R[i\, . . . , i a ], for all 
s < j — 1. Such a tuple i is called an index tuple for relation R. 

Structure of certificates. Any certificate for the query q is a set of comparisons of the form 

R[i] e s[S\ 

where 6 G {=, <, >}, R and S are two relations in atoms(g) that shares an attribute Ak, and i and j are index tuples 



such that R[i] and S[j] are A^-values. As we have seen from Section D.2.1 it is important to allow the certificates 



to compare values from the same relation. Hence, in the definition above R and S could be the same relation! 
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The constraint data structure. The constraints and the constraint data structure becomes much more involved for 
the general query case considered here. Every constraint is of the following forms: (([N] U ° [£, r] o 

i.e. it has an interval on the attribute Ai (for some i £ [n]) with trailing *'s and a prefix that has either *'s or an 
element of [N]. The entries that are elements of [N] are of the type (= p) as those in Section D.2.1| Appendix |D.5| 
presents the constraints and the data structure in details. 



The algorithm. The algorithm is described in the details in Appendix D.3 We present here briefly the main ideas 
which were alluded to in Section [373) Initially, the constraint set C is empty. When asked, C returns a new probing 
point t which lies in the output space and which does not satisfy any existing constraints in C . By examining t, each 
input relation R of arity k inserts into C about 2 k constraints which are meant to rule out a large region containing t 
that R itself "knows" for sure there cannot be any output. Overall, there are at most m2 n new constraints added to 
C per iteration. At least one of these constraints can be "charged" to a pair of comparisons in the optimal certificate 
in such away that no comparison is charged more than 2™ times. Consequently, the total number of iterations the 
algorithm loops through is at most 0(2™|P| + |output|), where P is an optimal certificate. Appendix |d| presents the 
algorithm in full and provides a formal proof of the following result. 

Theorem D.5. The total run time of the algorithm described above is 

O (mn2 n \P\\og(\input\) + mn \og(\input\) ■ \ output\ +Tc), 

where P is any certificate, \output\ is the number of output tuples, \input\ is the input size, and Tq is the total 
time it takes the constraint data structure to answer all queries from the algorithm. The algorithm adds a total of 
0(m^ n \P\ + \output\) constraints to C and issues 0(2 n \P\ + \output\) new probing point accesses to C. 

While it is great that the number of iterations of the algorithm is essentially linear in the size of the optimal 
certificate, the overall run-time of the algorithm depends on how efficient the constraint data structure is in handling 
the constraints and answering the queries. With the certificate structure described above, the maximum size of a 
certificate can only be 0(|input| 2 ), where |input| is the input size to the join problem. From there, the following can 
be shown. 

Theorem D.6. Unless the exponential time hypothesis is wrong, no constraint data structure can process the 
constraints and the probing point accesses in time polynomial in the number of constraint inserts and probing point 
accesses. 

Proof. We prove this theorem by using the reduction from UNIQUE-CLIQUE to the natural join evaluation problem. 



The reduction is standard 16 . The unique-/c-CLIQUE input instance ensures that the output size is at most 
1. The result of Chen et aL^j implied that if there was an 0{n°^)-t\me algorithm solving unique- fc-CLiQUE, 
then the exponential time hypothesis is wrong, and many NP-complete problems have sub-exponential running 
time. Consequently, if there was a constraint data structure satisfying the stated conditions, the exponential time 
hypothesis would not hold. □ 

D.2.3 Instance Optimal Acyclic Queries with Reverse Elimination Order of Attributes 

The above negative result requires us to restrict the class of input instances which we hope to design an instance 
optimal algorithm for. Somewhat expectedly, we are able to do so for the class of acyclic queries Ch. 6.4], given 
that the global attribute order is a reverse elimination order (REO). The REO order will be defined formally in 
Appendix |D.5.2| In this case, the constraint data structure can indeed be implemented efficiently. 

Theorem D.7. For any acyclic query q with the consistent ordering of attributes being the reverse elimination 
order (REO), there exists a constraint data structure, denoted by ConstraintTree, that can perform the following 
operations: 

1. A constraint can be inserted into the ConstraintTree in amortized time 0{n\ogV), where V is total number 
of comparison vectors inserted into T overall. 

2. This ConstraintTree can return a tuple t that does not satisfy any constraint in amortized time 0(n2 3n log N) . 

The implementation of such constraint data structure and its analysis can be found in Appendix |D.5| Theorem |D.7| 
and Theorem |D.5| prove Theorem |3.7| 



D.2.4 Some consequences of our algorithm and comparisons to prior algorithms 
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Algorithm 12 Evaluate the query q = l*l_R e atoms( g ) R{A(R)) 



Initialize the constraint set C = 

For each relation R € atoms (q) do 

Add to C constraints eliminating all tuples above the highest tuple in R 
Add to C constraints eliminating all tuples below the lowest tuple in R 



While C can still return a tuple t = (ti, . . . , t n ) do 
For each R £ atoms (q) do 
k «- arity(-R) 
For p = to k — 1 do 

For each vector v £ {£, h} p do 



>t€ [N] r 



> {£, h}° contains only the empty vector 



If R 



Let (vi, 



Ah) Ah,h) 

l R ' L R, ' 



denote v, then Vj £ {£, h} 



lin | i | R 
Lax | z | R 



R 



•R 



Avi) Avi,v 2 ) 
l R > l R > 



■(v 1 



> f 



< t 



= *. 



Output the tuple t 
Add constraint (= t%,= t% 



o (ij) for all p e [arity(-R)] and for all R £ atoms(g) then 
[*„,*„]) to C 



— in 



else 



For each R £ atoms(g) do 
k arity(i?) 
For p — to k — 1 do 

For each vector v £ {£, h} p do 



> Interval (a, b) is the same as [a + 1, 6 — 1]) 



1? 



.(i>i 



1? 



Avi) 



■ (v 1 



(v,i)i 



(v,h)' 



A wori'f-cfl^e linear-time algorithm for acyclic queries. Yannakakis' classic algorithm for acyclic queries run in time 
0(|input| + |output|). Here, we ignore the small log factors and dependency on the query size. Our algorithm 
can actually achieve this same asymptotic run-time in the worst-case, when we do not assume that the inputs are 
indexed before hand. All we have to do is to first index all input relations using REO as the global attribute order, 
then run our algorithm described above. The algorithm runs in time 0(|-P| + (output |) where P is any certificate. 
It is not hard to show that in the acyclic case there is a certificate P with |P| = |fnput|. To obtain this certificate, 
we run Yannakakis' algorithm! Each time a semijoin reducer is run, some tuples are eliminated. The time it takes 
to eliminate tuples is proportional to the number of eliminated tuples. In order to eliminate a tuple, it must be 
verified that the tuple cannot be matched with any tuple on the other side of a join. This fact can be verified 
with two inequalities certifying that the tuple "fall between" two other tuples on the other side of the join. For 
uneliminated tuples, equalities can be used to certify them. The set of equalities and inequalities constructed this 
way is a certificate because if we run Yannakakis algorithm on any input satisfying these comparisons we would have 
exactly the same output. This certificate's size is proportional to the time it take to run two rounds of semijoin 
reducers, which is linear in the input size. 

D.3 Main algorithms for general queries, B-tree index case 



The algorithm. Algorithm 12 solves the natural join problem with input query q. 



Theorem D.8 (Restatement of Theorem D.5). The total run time of Algorithm 12 is 



O (mn2 n \P\ log \ input\ + mn\og(\input\) ■ \ output\ + Tc) , 

where P is any certificate, \output\ is the number of output tuples, \input\ is the input size, and Tc is the total 
time it takes the constraint data structure to answer all queries from the algorithm. The algorithm adds a total of 
0(mA n \P\ + \output\) constraints to C and issues 0(2 n \P\ + \output\) queries to C. 

Proof. First, let us count the maximum number of iterations by "paying" for the tuples t which represent each 
iteration. If t is an output tuple, then we use a "credit" from the term |output| to pay for this tuple. Hence, the 
hard part is to account for the tuples t that is not an output tuple. 
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Fix a relation R £ atoms(<7) with arity(i?) = k and a number p £ {0, 1, . . . , k — 1}. For any v £ {£, /i} p , the value 



R 



;(fl) -(«i,«2) 



' l R > 



is said to be t-alignable if all values 



7? 



■(l>l,D 2 ) 

'R ' l R ' 



.(l>l,...,l>p) 



are already t-alignable and if 



1? 



• Oi) .(vi,v 3 ) 
l R > l R > 



! 'if 



./,) 



is either equal to t a or it is not involved in any comparison in the certificate P. Similarly, we define t-alignability 

Since t is not an output tuple, there must be a relation R 



for the value R 



.(til) Avi,v 2 ) 

l R > l R > 



.(vi,...,V p ) 



,v P ,e) 



> l R 



with arity fc, some p€ {0, . . . , k— 1} and a vector v £ {£, h} p for which both R 



and i? 



•r 



> 4 i? > 'r 



21 



IS 



are not t-alignable. For this pair, the constraint added in line 

not empty because the pair of values are not equal to t a +1 ir)- And, each member of this non-t-alignable pair is 
involved in a comparison in P. We will pay for t by charging this pair of comparisons. 

Finally, we want to upper bound how many times a pair of comparisons is charged. We think of the pair of 
comparisons as an interval between 



and 



R 



R 



,-("i) Aviw) 

l R > l R. > 



.(Ul) .(wi,V2) 



' • ' l R 



,v p ) Avi,...,v p ,t 

! l R 



R i l R 



Avi,...,v p ) 



,v p ,h) 



! l R ! 'ij 

in a high dimensional space. To see the charging argument, let us consider a few simple cases. When p = 0, then 
the interval is a "band" from one hyperplane Hi to another hyperplane Hi. This band consists of all points whose 
A ai (R)-values are between -R[i^j] and R[ij{]- We call such an interval an n-dimensional interval. Due to the constraint 
added in line 21 a non-output tuple t from a later iteration cannot belong to the band anymore. However, t might 
belong to the left of Hi or the right of H 2 , in which case a new n-dimensional interval might be created that is 
charged to the comparison involving Hi or involving H 2 . Consequently, each comparison from a n-dimensional 
interval can be charged twice. 

When p — 1, the interval is an (n — l)-dimensional interval which is a band lying inside the hyperplane whose 
A ai f R \ -value is equal to R[i^^]. In this case, each comparison might be charged 4 times: one from one side of the 
hyperplane, one from the other side, and twice from the two sides inside the hyperplane itself. 



It is not hard to formally generalize the above reasoning to show that the comparison involving R 
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might be charged 2 P+1 times. Hence, the total number of iterations is 

at most 0(2 n |P| + | output |). And, the total number of constraints inserted into the data structure C is likewise at 
most 0{m4 n \P\ + | output |). □ 

D.4 Running Example 

Example D.3. Let Q 2 join the following relations: 

R(X) = [N] Si(X,Xi) = [N] x [N] 

S 2 (Xi,X 2 ) = {(2, 2), (2, 4)} T(X 2 ) = {1,3} 

In this example, the value domain of every attribute is [N]. The algorithm to compute Q 2 will run as follows: 

• First the constraint set C is empty. 

• For every relation, add constraints that will exclude all tuples below the smallest and above the highest tuple. 
Constraints from S 2 : (*,[M],*>, (*,= 2, [1, 1]), (*, = 2, [5, N]), (*,[3,N},*) 

Constraints from T; (*, *, [4, TV]) 

• Step 1: C returns t — (1, 2, 2) which does not belong to any interval. 
Constraints added from T: (*, *, [2, 2]) 

• Step 2: C returns t = (1, 2, 3) 
Constraints added from S2: (*, = 2, [3, 3]). 

After Step 2, C can not return any tuple. The algorithm terminates. 
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D.5 The constraints and constraint data structure 



D.5.1 Basic Data Structures 

We begin with a basic data structure, which we will call SortedList. Such a data structure L can store N 
numbers in sorted order with the following operations: 

1. Find(u, L) returns (b, i), where b g {0, 1} indicates whether v is in L or not. If b = 1, then i denotes the position 
of v in the sorted order otherwise it denotes the index of the value which is smaller than v in the sorted order. 

2. Get(z, L) returns the value at the z'th position in the sorted order. 

3. Insert(z, v, L) inserts the value v at the i'th position on the sorted array. 

4. Delete(z,L) deletes the value at the z'th position on the sorted array. 

5. DeleteInterval ([i?, r], L) deletes all the values in the interval [i,r] contained in the sorted array. 

Remark D.9. Even though we defined the SortedList data structure for numbers one can of course store more 
complex elements as long as there is a key value whose domain is totally ordered. 

Proposition D.10. There exists data structures that implement an SortedList L with N elements such that 
the first four operations above can be performed in time O(logiV) in the worst-case and DeleteInterval can be 
implemented in 0(\ogN) amortized time. 

Proof Sketch. If we use a BST with any self-balancing structure like Red-Black trees, then the claim on the 
first four operations follow immediately. For the claim on DeleteInterval note that this implies figuring out the 
index i' of the smallest £' > £ and the index f of the largest r' < r in L. This can be done with a combination 
of Find and Get operations and then performing f — i' + 11 many Delete operations. There might be many of 
these Delete operations but since each of them can be amortized against an Insert operation, we have the claimed 
O(log-ZV) amortized runtime for DeleteInterval. □ 

Next we define the functionality of another data structure that we will use, which we call IntervalList. This 
data structures stores N intervals [£i, n] for i € [N] (which can be overlapping) and supports the following operations 
(where I is the IntervalList): 

• Next(u,7) returns the smallest integer v' such that (i) v < v' and (ii) v' £ [£i,ri] for every i € [N]. 

• CONTAINS ( [£, r], I) returns whether there is an interval that contains [£,r]. 

• Insert([£, r], I) inserts the interval [£,r] into /. 

• Delete([^, r], I) deletes the interval [£,r] from / (we do not insist on [£,r] being "covered" any some interval 
ml). 

Next we show that one can use SortedList to construct an IntervalList with the above operations taking 
logarithmic amortized cost. 

Proposition D.ll. There exists an implementation of an IntervalList on N intervals that implements the 
Next and contains operations in 0(\ogN) worst-case time and the Insert and Delete operations in 0(\ogN) 
amortized cost. 

Proof. The main idea is to store the N intervals as disjoint intervals. The end points are then stored in a 
SortedList and then we use the various operations of an SortedList to implement the operations of IntervalList. 
Details follow. 

At any point of time we maintain m < N disjoint intervals [sj, tj] (i € [to]) such that 

U? =1 [i i ,r i ] = U5L 1 [a i ,t J -]. 

We then store the numbers si < t\ < S2 < < t m in a SortedList / (with an extra bit saying whether the 
number of the left or right end point of an interval) . Next we show how we implement the three operations need on 
I. 

We begin with Next(u, /). Let (b, i) Find(u, I) and u = Get(i,J). If u is the right end point of an interval 
then return max(it + 1, v), else return Get(z + 1, 1) + 1. It is easy to check that this implements the Next operation 



correctly and by Proposition D.10 all this can be implemented in O(logTO) — O(logTV) time. 

Next, we consider CONTAINS ( [£, r],I). The idea is simple: is we just if both £ and r are contained in the same 
interval. Let (bi,i) -s— Find(£,I), (u t <- Get(i,I), (bj,j) <- FlND(r, I) and Uj <- Get(j',7). Return true if and 
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only if i = j and Uj is the left end point of an interval. It is easy to check that Proposition |D. 10] implies this takes 
O(logiV) time. 

We begin with the Delete([^, r], I) operation. The main idea is to figure out the positions i' < j' of the elements 
in I that belong of intervals that are contained in [£, r) and then delete those positrons from I. We now present 
the details. Let us consider the "left" end first. Let (bi,i) <— Find(^, /) and Ui <— Get(i,7). If m is the left end 
point of an interval, then we perform Insert (i + 1,1 — 1,1) (with £ — 1 as the right end point of the new interval 
[ui,£ — 1]) and set i 1 <— i + 2. If Ui is the right end point of an interval then we set i' <— i + 1. Now we consider 
the "right" end. Let {bj,j) <— FlND(r, 7) and Uj <— Get(_j, I). If Uj is the left end point of an interval then we 
perform Insert(j + 1, r + 1,1) (with r + 1 being the left end point moving from Uj to r + 1) and set j' «— j; else we 
set f 4— j. Finally, we just have to delete all the positions from i' to j' in I. This can be implemented by calling 



Delete^', L) j' — i' + l many times. To analyze the run time, note that by Proposition D.10 this algorithm above 
can be implemented in time 0((j' — i' + l) logiV). However, note that each of the j' — i' + l positions that are 
deleted must have been added at some earlier point, which leads to an overall 0(log N) amortized runtime. 

Finally, we consider the Insert ([£, r), I) operation, which works pretty similarly to the Delete operation. To 
do this, we first delete all the intervals complete contained inside [I, r] and then add the interval (we might have to 
adjust the end points). We again start from the "left" end. Let (bi, i) 4— Find(£, I) and Ui <— Get(«, I). If Ui is a left 
end point of an interval then define £' <— Ui else £' 4- £. Now let us consider the "right" end. Let (bj,j) 4- FiND(r, I) 
and uj 4- Get(j, I). If Uj is the left end point of of an interval then define r' 4- Get(j + 1,1) else r' 4- r. Then 
run DeleteQ^', r'], I). Next, compute (b, i) Find(£',I) and then run Insert(i + 1,1' ,1) (as a left end point) 
followed by Insert(« + 2, r' , I) (as a right end point). By an argument similar to the Delete case, it can be seen 
that this algorithm runs in amortized time O(logiV) (and it is easy to see that it correctly implements the Insert 
operation) . □ 

D.5.2 Reverse Elimination Order 

We begin with the definition of the reverse elimination order (henceforth, REO). Given a join query q =lxl™ i Ri, 
we call an attribute Aj to be private to Ri if Aj only occurs in Ri. Now consider the following algorithm (which is 
a minor variation on the well-known GYO algorithm): 

• Repeat the following till q is empty: 

1. For every R C R' , remove R. Let the resulting query be q' . 

2. Output all the private attributes (in an arbitrary order) in q' . 

3. Remove all the private attributes in q' and rename it q. 

The REO of q is an ordering of the attributes that is the reverse of the ordering of the attributes output by the 
GYO algorithm above. 

We will make use of the following simple property of an acyclic query when the attributes in the relations are 
consistent with an REO. 

Lemma D.12. Let R and S be two relations in the acyclic query q under REO. If C is a common attribute of 
R and S, then the following holds. Let An (and Bg respectively) be the sequence of attributes (under the REO) of 
attributes in R (S resp.) till (but not including C ). Then either An C B$ or B$ Q Ar. 

Proof. For the sake of contradiction, let us assume that there two distinct attributes A and B such that A E 
Ar \ Bs and B € B$ \ Ar. Note that by the definition of Ar and Sr, both A and B come before C in the 
REO. However, in the GYO algorithm, it is easy to check that at least one of A or B become private attributes 
before C becomes a private attribute and hence, at least one of A or B has to come after C in the REO, which is a 
contradiction. □ 

D.5.3 Constraint Tree 

In this section, we present our main data structure, which we call a ConstraintTree, which we will use to 
efficiently keep track of all the constraints generated by the various comparisons generated by our probing certificate. 
Recall that n is the total number of attributes in the join query. 

A ConstraintTree T must support the following two operations: 

• Insert(u, i, c, T), where v is a "node" in T at "level" i that corresponds to a prefix vector c of length n — i 
such that every component is cither an interval or in [N] U {*}. that has to be inserted into the subtree of T 
rooted at v. The comparison c has a special structure: it has exactly one interval (after which all the positions 
are *- for brevity these trailing *'s might be omitted). 
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lub(x, T) returns the smallest y > x such that the satisfy all the constraints represented by T. 



Remark D.13. We note that Algorithm 12 should call t 



LUB(t, C) in Step^ Further, we do not need to satisfy 
However if we repeatedly use lub in Algorithm\12\ it will produce the 



the constraints on y in the definiton above 
output tuples in lexicographic order, which is a good property to have. 

D.5.4 The Data Structure of Constraint Tree 
In this section we spell out how we implement a ConstraintTree. 

A ConstraintTree is a tree with n levels, one for each of the attribute with the root being the first attribute 
in the global ordering of attributes, the next level corresponds to the next attribute in the ordering and so on. Each 
node v in T has three data structures: 

• CHILD,, is a SortedList with one entry per child of v in the underlying tree. Each entry in SortedList is 
labeled with an element of [N] U {*} and has a pointer to the subtree rooted at the corresponding child. There 
are two constraints: if v is a lead then it does not have this SortedList and second, exactly one label in child,, 
is *. Given a valid label x £ [N] U {*} we will use CfflLD„(x) to denote the pointer to the corresponding subtree. 

• PARENt(ii) points to the "parent" of v, which is implied by the CHILD map above. 

• interval,„ is a IntervalList of intervals. (These are intervals that have been "ruled" out by comparisons. 
We mention two invariant that we always want to maintain about ConstraintTree: 

1. For every node v in a ConstraintTree, we make sure that none of the labels in CHILD„ is contained in an 
interval in interval,,. 

2. For any path from the root to a node v, the semantics of any interval [£,r] £ interval,, is the following. Let 
v be at level i and let c <G ([N] U be the string of labels in the path. Then any vector x £ [N] n that 
satisfies the following properties: (i) for every 1 < j < i, if Cj ^ *, then Xj = Cj and (ii) x\ £ [£, r] has been 
"eliminated" by the set of comparison stored in the tree. 

D.5.5 Insertion 



We present the algorithm to insert a comparison vector into a ConstraintTree in Algorithm 13 
Next we argue its correctness and talk about its runtime. 

Lemma D.14. Algorithm \13\ correctly inserts the comparison vector c into the ConstraintTree T in amortized 
time 0(n\ogN), where N is total number of comparison vectors inserted into T overall. 

Proof. Recall that we are only dealing with comparison vectors c which arc of the form ([N] U ° [I, r\ o 

{*} n ~ 1 i i- e - it has an interval for the ith attribute (for some i £ [n]) with trailing *'s and a prefix that has either 



's or an element of [N]. Algorithm 13 uses the prefix to go down the corresponding path in T and then insert the 



interval at the appropriate interval. Note that by the invariant mentioned in Section D.5.4 such an algorithm is 
correct. Thus, to prove the correctness of Algorithm fl3| we argue that the algorithm does indeed follow the outline 
above. Steps [3] and 11 allow the algorithm to go down the corresponding path in T for the prefix of size i — 1 and 
Step 13 inserts the interval [£, r] into the appropriate location. (Step 14 then removes all children of v that have 
labels in [£, r] since they have now been eliminated.) The rest of the algorithm takes care of the remaining "border" 
cases. First we note that if a value Cj ^ * is in interval^ for some node w at level j < i — 1, then it means the 
current comparison is subsumed by existing constraints and Steps [5] and [6] do this check. Further, if at node w there 
is no child corresponding to the value Ci (assuming of course that Ci is not contained in interval^, , which we have 
already checked for), then we need to create the child node. This is achieved by Step 10 Finally, we have to take 



care of the case that the current set of constraints in T rule out all possible tuples: this is achieved by Step [15] and 
subsequent steps. 

Propositions D.10 and D.ll imply that every step (except Steps[3]and 11 1 in Algorithm 13 can be implemented in 
(amortized) time 0(log N) Q In other words, we perform 0(\ogN) amortized work in each recursive call. Further, 
it is each to check that Algorithm [13] traces a path in T and each recursive call makes the algorithm go to the next 
level. This implies that there are at most n recursive calls, which with the above argument implies the claimed 
runtime. □ 



7 We note that each comparison vector adds either one entry to the child SortedList or one interval to the interval 
IntervalList to any node in T. Since there are N comp aris on ve ctors in total, each child„ and interval,, for any node v 
has at most O(N) elements, which by Propositions D. 101 and D.ll implies the claimed O(logJV) time. 
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Algorithm 13 InsertTree(w, c, i, T) 

Input: A ConstraintTree T, a node v in T (at level i) and a comparison vector c = (ci, . . . , c n ). 
Output: Update the data structure with c. Output full if the subtree rooted at v has eliminated all possible 
tuples 



1: u <— null > Flag used to check for a full 

2: If Ci is * then 

3: U <- lNSERTTREE(CHILD t ,(*), C, i + l,T) 

4: If Ci is Ui then 

5: If CONTAINS ( [ttj, Ui], INTERVAL^) then 

6: Return null > c is subsumed by an exiting constraint 

7: else 

8: (6, w) <— FiND(«j, CHILD„) > Checking is u, is a valid child label 

9: If then 

10: Insert(u> + 1, Ui, child„) > Create a new node for child with label Ui 

11: U <- lNSERTTREE(CHILD„(u i ), C, i + 1, T) 

12: If a is [£, r] then 

13: Insert([£, r] , interval^ ) > Insert the interval into interval list 

14: DeleteInterval([£, r], child„) > Update the child„ 

15: If u is full then > All tuples in the child subtree have been exhausted 

16: If Ci is * then 

17: Return full > This node has also exhausted all possibilities. 

18: else If Ci is Ui then 

19: lNSERT([Mj, Wj], INTERVAL,,). 

20: If CONTAINS([0, N], INTERVAL,,) then 

21: Return full 

22: else 

23: Return NULL 



D.5.6 Lub : New Definitions 
We start with some definitions. We think of [—] branches at the left and [*] branches at the right. 

• A node s at level i in the tree can be identified by a tuple of length i where each component of the tuple is one 
of two types: (1) in the format [= Ci] where Cj G [N], called an equality, or [*], called a star. We call this tuple 
the path to s and denote the tuple by P(s). 

• A specialization of a node s is a node v in the tree such that the path to v can be obtained from the path to s by 
substituting [*] values that occur in P(s) with equalities. We require that there is at least one such substitution 
(so that u is not a specialization of itself). Said another way, v's path {P(v)) satisfies (1) if P(v) = [*] then 
P(s)i = [*] and (2) if P(s)i is [= c] then P(u) i is [= c]. 

• A node s is a specialization wrt x of v if for every extra equality in s (as compared to v), say at attribute Ai is 
of the form = Xi. 

• An interval I in the tree at height n — h can be described by a tuple (y, [I, u]) where v is a node at height n — h 
and I < u are in [N]. 

• We say that an element x € [N] n intersects an interval (v, [I, u]) at level i if (x l5 . . . , a;j_i) matches P(v) and 
Xi G [l,u\. 

• Given intervals / = (s, [l,u]) and J = (v, [l',u']), we say that J is to the right of / if whenever P(s)i = * then 
P{ v )i — * an d P(v) contains strictly more stars than P(s). 

• It follows immediately that if v is a specialization of s then s is strictly to the right of v. 

D.5.7 Lub : The Algorithm 

We present our algorithm to compute the lub of a given frontier vector x w.r.t. a ConstraintTree T in 
Algorithm [M] 
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Algorithm 14 LUB(x,u,i, T) 



Input: A ConstraintTree T, vector x G [N] n and u is a node in T at level i G [n]. 
Output: A lub of x w.r.t. T. 



2 
3 
4 

5: 
6: 
7 

8: 

9: 
10: 
11: 
12: 
13: 
14: 



<— NEXT(x,;, INTERVAL^). 

If x' > x.- L then 

% <- ^ for i = 1,...,£- 1 
Vi <~ x' 

ijk for fc = i + 1, . . . , n. 
For w a specialization of it wrt x do 
InsertFull(w, [xi,x' — l],i,T) 

x <- (yi, ■ • .,j/ n ) 
While true do 

y <- lub(x, CHILD„(*), i + 1, T). 
y' <- LUB(y,CHlLD u (y. i ),i + 1,T). 
If x = y' then 

Return x 
x <- y' 



> a;, lies in an interval 
> Advance the frontier at i from Xi to x' 

> We identify u as the killer of [xi, x^] in v 

> We have ensured x < y < y'. 
o Some branch advanced (conflict). Check again. 



Algorithm 15 InsertFull(«, [I, u],i,T) 



Input: A ConstraintTree T, v is a node in T and an interval [l,u). 
Output: Insert [l,u] and handle if its full 



1: InsertTree(i;, [I, u] o {*}"-% i, T) 

murdered as above) 
2: If CONTAINS ([0, N], INTERVAL,,) then 
3: HandleFull(v,T) 



> For whatever portion is inserted, that part is considered dead (and 



Algorithm 16 HandleFull(i>, T) 



Input: A i! is a node in a tree T. 
Output: A Tree Updated with v 



If v is the root then 
Return 

p parent(u). 
If v is a [*] branch of p then 
HandleFull(p, T) 

If v is reached via the [= c] branch of p then 
InsertFull(p, [c, c],i — 1, T). 



> Done! The algorithm is finished. 



> p is now full too! 



We will argue the following: 

Theorem D.15. For acyclic queries with the consistent ordering of attributes being the REO, Algorithm \14\ cor- 
rectly computes the lub o/x w.r.t. T in amortized time 0(n2 3n log N)r\ 

Proof. We defer the runtime analysis to Section |D.5.8| 

The argument for the correctness is fairly straightforward. We present a sketch for the sake of completeness. If 
Xi falls in interval,, , we must advance x at level i to beyond the right endpoint of the corresponding interval. This 
is exactly what Steps [2][5] do. Step [7] just broadcasts the knowledge of the interval just discovered to the relevant 
nodes in T. (The importance of this step comes to the fore in the runtime analysis in Section D.5.81. 

If Xi does not fall into any interval then we need to explore both the corresponding [*] branch and the [=] branch. 
This is exactly what Steps 9pT do. In particular, in Step 10 we explore the [*] branch and in Step 11 we explore 

8 Here the amortization is over both all the calls to lub over the entire run of Algorithm 12 as well as all the intervals inserted 
by Algorithm 12 because of comparisons. 
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the [=] branch. (Again this ordering will be useful in our runtime analysis.) If we make no progress in either branch, 
we return in Step [13] else we continue the explorations by updating the frontier in Step |14| 



We are done except we need to prove the correctness of Algorithm 15 and 16 The argument for correctness of 



Algorithm [16] is exactly the same argument we used to handle fulls in proof of correctness of Algorithm U3\ and is 



omitted. Given the correctness of Algorithm 16 and 13 the correctness of Algorithm 15 follows immediately. 

To complete the proof, we note that whenever the algorithm advances the frontier, it does so by the minimum 
possible amount and hence, it computes the LUB correctly. □ 

D. 5. 8 Runtime Analysis of Algorithm [74] 
We begin with a set of crucial definitions. 

Definition D.16. An InsertFull operation is triggered by an interval in the Next call. An interval I is dead, 
if it is the result of an InsertFull operation, and I does not overlap with any live interval in that corresponding 
interval list. 

More precisely, let us consider the case when [£,r] is added to a node v due to an InsertFull at node v. We 
consider two cases: 

• [£,r] intersects with [£i,ri] in interval,; on the left and this is the only intersection with any interval in 
interval^, [ri + 1,7"] is the new dead interval (which is killed by the interval that triggered the InsertFull/ 
If[£\,ri] was originally live, then [£i,ri] remains live. 

• [£,r] intersects with live intervals [£i,r\\ on the left and [£2^2] on the right. Then we replace the three intervals 
by [£i, r 2], which is now considered live. 

• All original intervals in the ConstraintTree T are live. 

• One can convert any live interval into a cashed-in interval. (We will come back to this when we prove Proposi- 



tion 1X22.) 



We make the following observation about Algorithm [12] Every interval that is inserted into T has the following 
property. If the comparison vector inserts the interval [l,r] at node v for attribute C, then there exists a relation 
R such that the C is an attribute of R. Further, the prefix of the attribute set of R before C exactly matches the 
sequence of attributes corresponding to the equalities in P(v). For notational convenience, define this prefix to be 
pref(tj, C). Further, let the relation R be denoted by REl(tj). 

Now note that by the above description of which intervals are live and dead, we can only have live intervals at 



nodes v where Algorithm 12 inserts a interval. We finally record a property that will be crucial later: 



Lemma D.17. If there exists two distinct nodes u and v in T such that both have a live interval for the same 
attribute then either u is a specialization of v or v is a specialization of u. 

Proof. We first note that since both u ^ v have a live interval for the same attribute (let's call it C), then 



rel(w) 7^ rel(k). Further, we note that u is a specialization of v iff PREf(u, C) C pref(w, C). Finally, Lemma D.12 
completes the proof. □ 



Killing Mappings. We construct a map $(/, z) — J, called the killing map, that is implicitly constructed in Algo- 



rithm 14 The killing map takes an interval / and a scalar value z contained in that interval i.e. / = (s, [I, u]) and 
z 6 [l,u] to another interval J — (v, [l',u']) such that z G [Z',t/]-and v is strictly to the right of s. 

We say that J killed (I,z). We want to keep track at a fine granularity of who killed each part of an interval 
(ping-pong-ing may create a larger interval that is to blame) . Informally, this mapping is described in the comments 
of Algorithm [14] 

Proposition D.18. Killing mappings exist and are well-defined. 

Proof. For every dead interval, we claim killing mappings exist. 

By inspection of the algorithm, the frontier is only advanced for two reasons: 

1. an interval in the current list moves it forward as of the result to a call to Next in Step[l] In this case, there 
is a single interval responsible for this move, say J and we say that J is responsible for all the values in the 
frontier for X{ <E J. In this case, all intervals inserted by in Step [7] of Algorithm [14] are killed by J. 

2. as a result of a list filling up, in which case HandleFull takes care of this. We will handle fulls separately 
later in our argument. 
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□ 

Definition D.19. We say J recursively killed (I,z) if either <!>(/, z) = J or <f>(/, z) — J' where J recursively 
killed (J',z). 

Proposition D.20. Suppose we encounter an lub computation a node v and y < y' (as described in Algo- 



rithm H). Then, for every element in q £ [y, y'], there must exist an interval J under an equal branch of v that is 



live and intersects q. 

Proof. Suppose not, let I — (s, [I, u]) at level h be the smallest interval that advances the LUB on the [=] branch 
(the y' probe). Such an interval must exist since the lub is advanced by the [=] branch (and y < y'). Moreover, 
there must be some element z £ [I, u] that is larger than y, i.e. substituting the starsr] Let z be the smallest 
such element. Such an interval and z must exist (otherwise the interval / would contain all the information and no 
advancement of the lub would be needed). If this interval is alive, we are done. So we may assume without loss of 
generality that / is dead. 

Since the killing map is defined, every J that recursively killed (/, z) and is under the [=] branch of v must also 
be dead (otherwise we are done, as we've found the required live interval). However, eventually (after at most n 
steps) the action continually moves right in the tree until we arrive at a J that is under the [*] branch of v and 
killed some interval under the [=] branch. However, this implies a contradiction: we show that the lub on [*] branch 
should have moved past y (where it was claimed to have ended): in particular let q = y except that q^ = z. Then 
q > y but q is ruled out by the interval J, so the lub on the [*] branch should have moved past it, a contradiction. 
In fact, given the above and the fact that the interval J that killed / is always a superset of the latter, we can do 
the argument till z — u above. In other words, if the frontier advances from y to y 1 due to I above, we have take 
care of all q £ [y, y 1 ). 

Now consider the the next time the frontier in the [—] branch advances from y 1 to y 2 . Note that in this case, this 
advancement has to happen to the left of s. Thus, we can do the same argument as above to match every q £ [y 1 , y 2 ) 
to a live interval in the [=] branch of some descendant of s (and hence trivially under the [—] branch of s). We can 
inductively continue this argument to finish the proof of the claim. □ 

Remark D.21. It might be useful for the later proofs to remember that when y' advances in the [—] branch, it 



might advance the different attributes at different calls to LUB. Proposition D.20 implies that for every advancement 



ofy', no matter for which attribute, there is a live interval in the [=] branch at the same level. In particular, for the 



current y' and every attribute i, there is a corresponding live interval by Proposition D.20 at level i (in T) under 
the [=] branch of some v. 

Informally, the first pass through the while loop to find a live interval is free (nothing has to get killed) . However, 
for our charging argument to work, if we repeat the while loop at a particular level and advance, then we need to 
argue that some live interval will be cashed-in. In turn, we use this to bound the total number of such conflicts. 

We use the notation y(v,t) (y'(v,t) resp.) to indicate the value of y (y' resp.) after at a node v after t iterations 
of the while loop. We introduce the notion of cashing in to analyze the algorithm. A live interval may be cashed-in. 
We will maintain that a live interval [l,u] will immediately precede a dead interval, i.e., the smaller cndpoint of the 
dead interval will be u + 1. Intervals will be cashed-in with each conflict. 

Proposition D.22. Suppose there is a conflict at iteration t + 1 at node v in its while loop that is y'(v,t) < 
y'(v,t+l)-then at least one live interval is cashed-in the [=] branch ofv during the lub computation for y' . Moreover, 
no interval is cashed-in twice. 

Proof. Let y' = y'(v, t), that is the final state of y' before we "cross" over to the [*] branch of v in the (t + l)th 
iteration of the while loop. Now, since there is a conflict, y advances under the [*] branch. Let us assume that the 



first advancement happens for attribute i. Thus, Proposition D.20 (and Remark D.21) implies that there exists a 
live interval J that corresponds to the last advancement of y' in the lub algorithm at level i. (Let w denote this 
node where y' advances.) Now, consider the first time y advances in the [*] branch. (Recall that this happens at 
level i.) 

Let this happen at node u. We first claim that the interval that advances y at u has to be live. The argument is 



similar to that used in Proposition D.20 If y advances due to a dead interval then we move right from u using the 



killing map. Note that we can keep on doing this till we cither hit a live interval (and this is the interval that should 
9 Define = d t if c t is = d; and qi = yi if di = * for i = 1, . . . , h — 1, qu = z and qj — for j — h + 1, . . . ,n. 
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have advanced y already). If we do not hit only dead intervals then ultimately, we will at a node a such that P(a) 
only has *s. Since a is not a specialization of any node, it is easy to check a can only have live intervals. Thus, we 
can assume that u has a live interval. 

Now since u and w both have live intervals, Lemma |D. 1T| implies that w has to be a specialization of u. 

Due to Step [7] in Algorithm 14 will cause several InsertFulls- in particular, in all nodes at specializations at 



level i to the left of u in the [*] branch. In particular, this inserts a dead interval that is an immediate successor of 
J in w, since w is a specialization of u. We will cache in J below (since J is both live and is immediately followed 
by a dead interval). This shows that a valid cashed-in interval exists to be charged. 

To show that an interval is cashed-in once. We assume, for contradiction, that some interval I = (s, [l,u]) (at the 
ith level) is cashed in-twice. However, this implies that the interval I (after being cashed-in) advances the lub for 
some y', and since it will be cashed-in again, y[ = u+ 1. However, since the cashed-in interval is contiguous with a 
dead interval that starts at u + 1, the y[ would advance to at least u + 2 the "second time" around. This contradicts 
that / would be cashed-in again. Thus, any live interval is cashed-in at most once. □ 

In turn, this implies that the number of conflicting runs through any given while loop at a particular height is 
bound by the number of original live intervals, which is N . Thus, the overall number of conflicting operations is 
bound by nN. 

Moreover, we observe that this implies we will run through each loop at most twice when there is no conflict (per- 
haps we will advance on the [—] branch and then have no conflict with the [*] branch which is two lub calculations). 

Actual Running Time. We start the run time analysis by assuming that there are no conflicts and there are no fulls 
generated. 

Conflict-Free and FVhh-free Case. We first observe that if we the lub advances without conflicts this implies that 
in every while loop at every node v either we advance from the [*] loop or the [=] node but not both. Using the 
above that we will invoke at most 4 lub calls at each level, our running time is 4™ = 2 2n invocations. Each such 
invocation could spawn at most 2™ insert statements. Thus our running time is 0(2 3 ™ logiV). We stress that we did 
not try to optimize this bound. 

Next, we remove the assumption of no conflicts. 



Conflict and FVLL-free Case. If there is a conflict, then by Proposition D.22 there is a unique live interval that we 
cash-in. Since the number of unique live intervals is upper bounded by the original number of intervals, number of 
conflicts is bound by nN as argued above. This yields a running time of 

0(2 3n logN(l + nN)) = 0(n2 3n N\ogN), 

since the time between any two conflict is bounded by the runtime bound in the case above. 
Finally, we consider the general case. 

Conflict and full Case. We start with the two main observations that complete the runtime analysis for the general 
case: 

1. By looking at the statement of InsertTree algorithm, note that whenever a HandleFull is called at node 
v, we delete the subtree of T rooted at an ancestor u of v (where u — v is allowed) and we insert a new interval 
in parent (it). 

2. Proposition |D . 20] implies the following: for any node v, if there is a conflict in the subtree rooted at v, then the 
live interval to which we charge this conflict also lies in the same sub-tree. 

The above implies that we can do the run time analysis as follows: we divide the run of the algorithm into phases, 
for between two fulls. Note each phase can have conflicts but cannot have a full, i.e. it falls in the previous case. 
Further, (and more importantly), intervals that are in a subtree that is dropped never gets used again in the run of 
the algorithm. Thus, in some sense after one full, in the next phase it is as if the algorithm was always working 
with the new ConstraintTree. The catch of course is that we need to somehow "pay" for the new interval that 
is inserted by HandleFull. However, note that modulo this payment we can just run the analysis above over all 
phases and get the same run time. 

Finally, we discuss how we pay for the new intervals inserted by HandleFull. Note that we can only insert a 
new interval via HandleFull if there is at least one original interval in the subtree that is thrown away in between 
two phases. In other words, the total number of fulls that we can have at any level is at most N. Thus, over all 
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levels, we might end up with nN fulls. (This is because we could potentially "charge" the same original interval 
for n fulls- one for each level that HandleFull algorithm "climbs" in T.) Now, since each such interval might 
be charged one conflict, we will need to add overall to the run time the same bound as we did in the previous case, 
which implies that we still have the same asymptotic run time of 0(n2 3n iV log N). 

E. MISSING PROOFS FROM SECTION |4J] 

E.l Proof of Theorem 1331 

Proof of Lemma 14.31 Given (x,y), first run binary search on R. If x g" R, then we know the largest number 
T{ < x. Then set i x = and r x — 7-j+i to obtain the corresponding X-hole. Otherwise x G R and we run binary 
search on T with y. If y T, then we can compute the corresponding Y-hole as above. Otherwise we know case (i) 
is true. Since we only run constant many binary searches, the time bound follows. □ 



We now finish the proof of Theorem |4.2| Next, we again consider another special case: in particular, let us 
assume that hy = (and we know this). However, unlike the previous case we do not know about the JT-holes in 
advance. The idea is to run the algorithm from the previous special case but build the set of X-holes on the fly. By 
making sure that the traversal algorithm for a 1D-BST iterates through the leaves from the smallest to the largest 
value, we can assume (that in the absence of any X-holes), the algorithm iterates over all elements in S in order 
of the X values (for for each fixed x G X, it iterates through all the (x,y) in order of the y values). Further, by 



Lemma 4.3 in Oilogn) time, we can check whether an (x,y) is in an X-hole or not (and if so then compute the 
interval corresponding to the X-ho\e). Thus, by Remark |B.2| and the argument for the earlier special case, we can 
conclude that we can solve the case of hy — in time 0((hx + \0\) logmlogn). 

Next, we consider our final special case. Let us assume that we know that hy > and the Y-gaps are known 
up front- I2, . ■ ■ , Ihy+i- Let y min and y max denote the smallest and largest y values in T. (Note that we can 
compute these values in time O(logn).) Let L\, . . . ,Lh Y +i be the live intervals, i.e. these are the intervals in the 
set [y m in, !/max] \ U^Zg" 1 /;. Note that the final output is 

O = u^ +1 (i? xr)n(Sn Lj). 

For notational convenience define Sj d = S H Lj and Oj = (R X T) l~l (S fl Lj). Then note that O is the disjoint union 
of Ujl.^Oj. Further, Oj has no Y-holes. In other words, if we had access to a 2D-BST for every Sj, then using the 
algorithm for the special case above, we could have computed Oj in time 0((hx + \Oj\) logmlogn), which means 
we can solve the entire problem in time 0((hxhy + \0\) logmlogn). 

Thus, to finish the proof we need to show (i) how we can simulate a 2D-BST for Sj from the given 2D-BST for S 
and (ii) remove the assumption that we know the y-hol es (a nd hence the live intervals) beforehand. We begin with 



(i). We first note that given a vertex v in T , by Lemma 2.6 we can in O(logm) time compute a succinct description 



of which y values in T Y (v) still lie in Sj (and in particular, we know whether this number is non-zero or not). Thus, 
we can simulate a traversal on the 2D-BST for Sj by traversing the 2D-BST for S as before but not going down the 
subtree of any vertex v for which we know that there are no values in S whose y coordinate lies in Lj . Note that this 
would traverse only a constant factor mor e nod es than when we had a perfect 2D-BST for Sj and we would spend 



O(logm) factor more time (due to Lemma 2.6) in each node of T ■ Thus, in the worst case, applying this trick on 
the algorithm in the paragraph above will lead to an overall time of 0({(hx + l)(hy + 1) + \0\) log 2 mlogn), as 
desired. 

We finally argue how we get rid of the assumption of knowing the Y-holes beforehand. The trick is the same one 
that we used earlier for X-holes. We will compute the set of live intervals in an online fashion. In particular, we 
will have a set of candidate live intervals C. We initialize C = [y m in , 2/max] ■ We run the algorithm till C — and at 
any point of time we work with the "left-most" candidate live interval L in C. If at any point, we encounter a new 
new Y-hole during our traversal, we split L into (at most two) new live intervals L\ and L2 (where L\ is the left 
most among the two). We then continue the algorithm with L\ and put L 2 back into C. The only thing we have to 
careful about is not to repeat whatever part of the traversal we have done so far L for To avoid this, when we 
add L2 to £, we retain the state of the traversal so that when we work with L2 at a later point, we can just re-start 
the traversal for L 2 at the point when L got "split." To finish we make the simple observation we only need to store 
O(logm) amounts of information to maintain the state. 

Lemma E.l. It is enough to maintain a path of length O(logm) in the 2D-BST for S as the current state of 
traversal in the 2D-BST. It takes 0{\) time to compute the next node in the traversal (and to update the state). 
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E.2 Proof Sketch of Theorem HI 



The proof of Theorem |4.4| is pretty much the same as that of Theorem 42 so we just sketch where the proof differs. 
Note that every "live" interval L considered in Algorithm [4] belongs to Cy. Now note that Algorithm [4] basically goes 
through all points in S that lie in L but jumps over all the A-holes it encounters. (There are two remarks. First, 
Algorithm [4] does not know about all the A-holes in L n X up front. However, the algorithm "discovers" such an 
A-hole when it encounters the first element in the hole and hence, we can w.l.o.g. assume that the algorithm knows 
all the A-holes in L n X up front. Second, Algorithm [4] at some point might work with a live intervals that cover 
more than one interval in Cy. However, this only happens when the algorithm has not discovered the Y"-hole(s) 
that fall between these multiple intervals in Cy. Hence, we can again assume w.l.o.g. that the algorithm knows the 
intervals in Cy up front.) Thus, if Oi C O are the output points that fall in i G Cy, then the algorithm spends 
time (up to poly-log factor) \£ l~l X\ + \Og\ dealing with points in £. The discussion above then implies the claimed 
runtime in Theorem IL4I 



E.3 Comparison with Previous Algorithms: Details 

PROOF of Proposition 14.121 Given that h x — hy — 2 and that the output is empty, Theorem |4.2| implies 
that Algorithm [4] will takes time (3(log 3 n) on the instances in Example 4.1 On the other hand, let us consider our 



Algorithm [3] where assume that S is first indexed by X and then Y (a similar argument holds when we first sort 
by Y and then A). The first step in this algorithm will be to compute R D ttx(S), which note is same as R. In 
other words, just this first step will take time f2(n), which is exponentially worse than the runtime of Algorithm |2j 
Finally, we consider the NPRR algorithm. In this case the best fractional cover of the query hyper graph is to pick 
S to the extent of 1 and to pick R and S to the extent of 0. In other words, we go through each tuple (x, y) G S 
and check if x € R and y G T. Note that this will also take £l(N) time. □ 

Proof of Proposition 14.131 We begin with NPRR. We first note that NPRR on the bow-tie query runs in 
time <3(min(|i?| • |T|, \S\)). Since h x < \R\ and hy < \T\, Theorem [4~2| and Corollary [4~5| implies that Algorithm^ 
never runs slower then NPRR (up to poly-log factors). 

We now consider Algorithm [3] For notational convenience, we assume that the one index for S is first sorted by 
Y and then by X. (If not, one can suitably restate the run time of Algorithm [4] to make the rest of the argument 
go through.) Then Algorithm [3] first computes y' = Tnn Y (S). Note that by definition, \Y\ > \C Y \. Further, for 
each y G y' , we perform the set intersection Rn S(X,y). We group these intersection as Uy^LnyR^ S(X,y) for 
every L g Cy. We claim that the total time taken on these intersections is f2(|L n X\ + |0l|). This claim along with 
Theorem 4.4 then implies that Algorithm [4] is no worse (up to poly-log) factors than Algorithm[3j as desired. Finally, 



we argue the claim. The main observation is that the DLM algorithm on RD S(X,y) takes time (up to logarithmic 
factor) at least linear in the number of X-holes that cover some element in S(X, y) (as well as the number of pairs 
(x',y) G O). This is because every such A-holc exactly corresponds to a jump/gap in R in the DLM algorithm. 
Summing up over all y G L n y' implies the claim (note that the total number of relevant A-holes encountered is at 
least \L l~l X\). Thus, we have argued that □ 

E.4 Missing Lower Bound Proofs 



Proof of Lemma 14.81 To explain the main idea, let us first assume that algorithm A can only issue non-adaptive 
box queries, i.e. A cannot design the next box query based on the answers to the previous queries. 

We construct an instance of the problem that can "fool" A if it issues two few queries. Let n = \X\, to = \y\, and 
s = |<S|. We first define X and y, and specify S later: 

X = {(2i - 3/2, 2i - 1/2) | i G [n]} 
y = {(2j-3/2,2j-l/2) \j€[m]} 

In other words, X is the collection of n open intervals of length 1 centered at the odd integers 2% — 1, i G [n]; and 
y is the collection of m open intervals of length 1 centered at the odd integers 2j — 1, j £ [to]. Points in S, which 
we have yet to specify, will have non-negative integral coordinates. Hence, for the purposes of bounding the number 
of queries A makes, without loss of generality we can assume that every box query that A issues has the form 
{[ail, £2], [2/112/2]} where x\ , x% , y\ , y% G Z. (If a box query does not satisfy the property then we can simply replace 
it by a "rounded" query satisfying the property with identical answers.) 

A 2 x 2 gadget is a set of four locations where we will put a couple of points in S . In particular, for every pair of 
indices i, j where i G [n] and j G [to], define the 2x2 gadget to be the set of four locations: 

G h3 = {{2i - 1, 2j - 1), (2i - 1, 2j), (2i, 2j - 1), (2i, 2j)}. 
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Note that of the four points in Gij, only (2i, 2j) is not covered by X and y, and the other three are covered. Now, 
define 



G° itj = {(2t-l,2j),(2i,2j-l)} 
= {(2i-l,2j-l),(2i,2j)}. 

And, suppose 5 fl G,*j is either or G*j. Then, if a box query contains an even number of points from the 2x2 
gadget Gij, then the box query won't be able to distinguish whether S n G,j is G°j or Gj .. 

Now, let q be the total number of box queries that A issues to the data structure. Without loss of generality, 
assume 4g < ran and 8q + 2 < \S\; otherwise we are done. For any box query Q, the number of 2 x 2 gadgets that Q 
has an odd number of points in is at most 4 (at the 4 corners of the query). We choose points in S so that, for each 
2x2 gadget G of those (at most) four gadgets at the corners of Q, S n G = G°. Thus far we have used up at most 
8q points in S. Since 4q < ran there is still at least one gadget G that is not at any corner of the q box queries. We 
use the next two points in S to put in G so that S fl G is either G° or G 1 . The rest of the points in S can be put at 
points covered by X and y that are far away from the gadget. Whether S n G is G° or G 1 , the results of the box 
queries are identical and hence A cannot tell whether S is covered. 

Finally, when A can issue adaptive query, our strategy is the same: we put points in S at arbitrary q gadgets G 
such that S n G = G°. In the end, there must be at least one gadget G not covered by the q adaptive queries. At 
that point we can switch so that S H G — G 1 . □ 

Proof of Lemma 14.91 The returned point set of a box query to I? is a collection of subtrees of the original 
dyadic tree T>. Each subtree is itself a dyadic 2D-BSTs. Now, each time A visits a node of a subtree, we can 
unambiguously view the visit as another box query corresponding to points in that sub-tree. Hence, the situation 
is exactly the same as in Lemma |4.8[ except that we charge each node visit to be a box query. Hence, the bound in 
the number of box queries in Lemma |4.8| becomes a run-time bound for A in this case. □ 

Proof of Lemma 14.101 For convenience, for each i g [d], define = \Xi\, and 

^ = {(2* -3/2, 2* -1/2) | ji G H} 

For each tuple j = (j 1; • • • , jd) € [ni] x ■ • • x [rid], define a d-cube gadget 

Gj = {(xi, ■ ■ ■ t x d ) | Xi e {2ji - 1,2;/,}, Vi e [d]} . 

Note that there is exactly one point in Gj that is not covered by the (Xi)f =1 ; that is the point (2ji, 2j2, • • • , 2jd)- 

Finally, define G? to be the set of points (xi, ■ ■ ■ ,Xd) € Gj such that the number of indices i € [d] for which 
Xi — 2ji has different parity from d. And, define Gj to be the set of points (x\, • • • , xj) £ Gj such that the number 
of indices i G [d] for which x% — 2ji has the same parity as that of d. Formally, 

G? = {(xi, • • • , Xd) € Gj such that 

\{i G [d] : Xi = 2ji}\ -d = l (mod 2)} , 

and 



Gj = {(xi, ■ ■ ■ , Xd) € Gj such that 

|{*G[d]:a! i = 2j i }|-d = Q (mod 2)} 
Note that the sole point (2j lt • • • , 2j d ) that (Xi)f =1 covers belongs to Gj. 



Now, similar to the proof of Lemma 4.8 suppose for some j we have S fl Gj is either G? or Gj. We claim the 
following is true: if a d-dimensional box query Q = (Ji, . . . , Id) does not contain exactly one point from the gadget 
Gj , then from the answer to Q we cannot tell whether S n Gj is G? or Gj . 

To see this, suppose Q contains more than 2 points from Gj. This means there has to be at least one index i G [d] 
for which both 2ji — 1 and 2ji belong to 7^. Now, let K be the (non-empty) set of all indices i G [d] for which both 
2ji — 1 and 2ji belong to l. L . This means [d] — K is the set of indices i for which either 2ji — 1 or 2ji belong to Zj, 
but not both. From these observations, it is easy to see that \Q fl Gj | = \Q fl G? | = 2l K l~ 1 , proving the claim. 
For each box query Q, there are at most 2 d cube gadgets G that has exactly one point in Q. We will make sure 



that S n G = G° for those corner gadget. The rest of the proof proceeds as that of Lemma 4.8 □ 
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E.5 Proof of Lemma 14771 

We will need the following lemma from Jl7]. The statement of the lemma is slightly different from Lemma 11 



17 but this one follows immediately from the proof of Lemma 11 in |17| . 
Lemma E.2. Assume 3SUM required Q(n 2 / f(n)) expected time and let R be such that 

w(/(n • f(n))Vn) <R< o(n/(f(n ■ /(«)))). 

Then Q,{n 2 / f 2 (n • /(n))) expected time is needed to report 0{n 2 / R) triangles in the following tripartite graph (for 
any x that divides n): 

• the three parts are A, J5, C of sizes \A\ = Rx, \B\ = Rn/x and \C\ = n 

• each vertex in C has at most x neighbors in A and at most n/x neighbors in B 

• There are 0(nR) edges between A and B . 

We note that Lemma 11 in 17 is stated for x = ^/nf^] For the sake of completeness we sketch the part of the 



proof of Lemma E.2 that differs from the proof of Lemma 11 in 17 



Proof Proof Sketch. [IT] shows that there exists a near linear hash function h : U [R] such that each 
bucket has 0(n/R) elements mapped to it. Further, for any r £ [R], let h^ x (r) denote the pre-image of r under h. 



(Note that \h (r)| < 0(n/R).) The hardness result in 17 follows by solving the following problem 



Given 2nR triples (x, y, i), which represent the edges between A and B, check whether h 1 {y) intersects 
h~ l (x + y) — i or hr x {x + y + 1) — i. 

We now present the construction of the tripartite graph. First let A = [R] x [x] , B = [R] x [n/x] and C — [n]. For 
any (a, i) £ A interpret it as the set h~ l (a) — i and interpret any (b, j) £ B as the set + j ■ x. There is an 

edge from (a,i) to c iff c £ h^ 1 (a) — i and there is an edge from (b,j) to c £ C iff c £ + j ■ x. As mentioned 

above there are 2nR checks to be performed, each of the form: does h~ 1 (y) intersect h~ 1 (x + y) — si First, this 
intersection is encoded as an intersection between h~ 1 (y) + \_s/x\ ■ x and h _1 (x + y) — s mod x. The latter is then 
encoded as an edge between (y, [s/x\ ) £ B and (x + y, s mod x) £ A. 

The claims on sizes of A, B and C as well as the number of edges between A and B follows from the construction 
above. Next, we verify the claim on the degree of elements in C . It is easy to check that the construction above 
implies that each c £ C has an edge to (h(c + i),i) £ A for every < i < min(n — c, x) (which implies that c has at 
most x neighbors in A) and an edge to (h(c — j),j) £ B for every < j < min(c, n/x) (which implies that c has at 
most n/x neighbors in B). 



The rest of the proof is exactly the same in the proof of Lemma 11 in 17 . □ 



Next, we present the simple implication of solving a bow-tie query to listing triangles in a tripartite graph. 

Lemma E.3. Let G be a tripartite graph on tripartition [A, B, C) such that there are at most m edges between A 
and B, at most t triangles in G and every c £ C has at most deg j4 (c) neighbors in A and deg B (c) neighbors in B. 

Also assume that any bow-tie query such that all of the relations R, S and T have sizes at most N can be solved 
in time 0{(g(hx ,hy) + \0\) ■ log d N) for some constant d>l. 

Then all triangles in G can be listed in time 



O I (^5(deg A (c),deg B (c))+i + TO + ^(deg A (c) + deg B ( C )) J log d max(m, J . 



(2) 



Proof. Let S denote the set of edges between A and B. (Note that IS"! = m.) For every c £ C let R c (and T c ) 
denote the set of neighbors of c in A (B resp.). For every c £ C , define the bow-tie query: 

O c = J? c MSN T c . 

It is easy to check that the set of triangles in G is exactly the output of the query 



10 There is another technical difference: Lemma 1 1 in |l7| assumes the hardness of the related problem of Convolution-3SUM 
hardness of Q.(n 2 /f(n)). However, Theorem 10 in [17] shows that an f2(n 2 //(n)) hardness of 3SUM i mplie s a Sl(n 2 / f 2 (n- /(n))) 



hardness for Convolution-3SUM hardness, which m turn explains the f (n ■ f(n)) terms in Lemma E.2 
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To solve the above we solve the bow-tie query O c as follows. We first build a 2D-BST for S in time O(mlogm) and 
then sort the elements in R c (and T c ) in time 0(deg A (c) log(deg A (c))) < 0(deg A (c)\og(\A\)) (0(deg B (c) log(|_B|)) 
resp.). Solving the bow-tie queries for every c € C by assumption takes time 

0(g(deg A (c),deg B (c)) + \O c \)log d N, 

where define N = max(m, \A\, \B\). Thus, summing up all these times along with the facts that t — J2 c ec \®c\ an d 
that we only need to build the 2D-BST for S only once implies the claimed running time in pi). □ 

We now collect all the results above to prove Lemma [4. 7| 



Proof of Lemma 14.71 We apply Lemma E.3 to the hard instance from Lemma |E.2| Note that in this case we 
have for every c € C, hx = deg^(c) < x and hy = deg B (c) < n/x). Thus, we have that g{hx, hy) < n 1 ^ 6 . Further, 
we have t = Q(n 2 /R ) and m = 0(nR). Thus, if we pick R and max(x,n/x) < o(n/(poly(log n)f 2 (n ■ f(n)))), 

implies that the 3SUM problem can be solved in time 0(n 2-e / 2 ), which would contradict 



then Lemmas 



E.3 



and 



E.2 



Conjecture |4.6| as desired. □ 

F. ILLUSTRATION FOR ALGORITHM 4 

See Figure [6] for a snapshot of the run of Algorithm [4] on the bow-tie query from Figure [3j 

G. (NON) INSTANCE OPTIMALITY OF LEAPFROG- JOIN 



In this section, we will summarize how Leapfrog join algorithm from 19 works and how it is related to our adaptive 
join algorithm. Leapfrog join has two parts. The first part is about the algorithm to compute the intersection of k 
sorted sets. The second part is about using the intersection algorithm to compute the general join query 

First we will summarize their intersection algorithm. The basic idea is similar to the DLM adaptive intersection 
set. Specifically, as in DLM, the element is eliminated in low-to-high ordering fashion. At any point of time, the 
set of smallest uneliminated elements in all sets is maintained. Suppose that a\ is the smallest and a 2 is the largest 
element in that set; also a x g Ai and a 2 £ A 2 . Then in the set A 1: the algorithm will gallop to find the smallest 
element a'i such that a[ > a 2 . It turns out that all elements between a\ and a' x (not including a' x ) are eliminated. 
And the smallest uneliminated element in A\ is now set to a! Y and the algorithm continues by the same fashion. 
In summary, this algorithm is nothing different from DLM algorithm; the only difference is that the average case 
is analyzed in DLM by introducing the notion of gap to obtain the better running time (by galloping in both sides 
instead of just from low to high) 

Second part the Leapfrog join algorithm to compute the general join query is introduced. The algorithm works 
as follows. First we need to have a global ordering of all attributes. Then for every relation, it will be sorted in a 
single index in which their attributes' order is consistent to the global order. To make it simple, we will describe 
how it works with triangle query R(A, B) XI S(B, C) X T(A, C). Here the global order is chosen to be A, B, and C 
and so R is sorted by A, then by B. Similar case is applied for both S and T. To compute this triangle query, first 
they use the intersection set algorithm in the first part to compute tta(R) H wa(T). Then for every a in that set, 
the intersection set algorithm is used again to compute UA=a{R) H ttb(S). Then again, for every b in that set, we 
compute <JB=b{S) H &A=a(C)- Finally we have the output tuple (a,b,c) for every satisfied element a, b, and c. 

It is claimed that the above algorithm is optimal in the worst case, but within the log factor. It is fairly easy to 
see that the above algorithm is not instance optimal for acyclic queries. Consider Example |3.1| Recall that in the 
example, the join is R{X) M S(X, Y) M S 2 (Y, Z) X T(Z), where |5 2 | = |T| = 1 and S 2 X T = 0. Then our algorithm 
can discover that the result join will be empty within the constant time (up to log factor) by running the elimination 
process between S 2 and T. But the Leapfrog join algorithm may take longer to discover that the result join is empty 
because their steps are as follows: first the join over attribute X is computed, then for every x, compute y, and 
finally for every y, compute z. Hence it would take Q (TV) time, which is not instance optimal. 
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Figure 6: Run of Algorithm [4] on the bow-tie query from Figure [3} We are at a stage when the X-hole 
corresponding to x = 1 and the y-hole corresponding to y = 1 have been discovered. The discovered 
X-hole is denoted by the light blue box while he y-hole is denoted by the pink box. (Ignore the 
yellow boxes for now.) At this point the live interval is [2,3]. The algorithm will first check the BST 
corresponding to the root of the X-BST it'll first verify that this is not a node that is contain inside 
the (blue) X-hole and then by a range query in the y-BST corresponding to the root, it thinks that 
there are 8 — 3 = 5 > live points. It will then go to the node corresponding to the x-range [1,2]. 
Here again it'll first verify that it is not contained in the existing X-hole and then it'll think that 
this sub-tree has 5 — 2 > live points. It will then go to the node corresponding to the interval [1]. 
Here it will realize that it has hit the existing (blue) X-hole and will then backtrack to the node 
with interval [2]. It will then verify that it does not fall in the existing X-hole and will verify that 
the corresponding y-BST has 2 — 1 > nodes. It will then traverse the y-BST, first hitting the pink 
y-hole and then moving on to the other leaf in the y-BST, which corresponds to the point (2, 3). The 
algorithm will first verify that this does not lie in an X-hole and then will find out that it lies in the 
y corresponding to y = 3 and hence, will discover the y-hole marked yellow. The live interval then 
shrinks to [2] and the algorithm continues. 
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